Computer Science
See recent articles
- [1] arXiv:2406.12876 [pdf, other]
-
Title: Controlling Chaos Using Edge Computing HardwareComments: 28 pages, 11 figuresJournal-ref: Nat Commun 15, 3886 (2024)Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Machine learning provides a data-driven approach for creating a digital twin of a system - a digital model used to predict the system behavior. Having an accurate digital twin can drive many applications, such as controlling autonomous systems. Often the size, weight, and power consumption of the digital twin or related controller must be minimized, ideally realized on embedded computing hardware that can operate without a cloud-computing connection. Here, we show that a nonlinear controller based on next-generation reservoir computing can tackle a difficult control problem: controlling a chaotic system to an arbitrary time-dependent state. The model is accurate, yet it is small enough to be evaluated on a field-programmable gate array typically found in embedded devices. Furthermore, the model only requires 25.0 $\pm$ 7.0 nJ per evaluation, well below other algorithms, even without systematic power optimization. Our work represents the first step in deploying efficient machine learning algorithms to the computing "edge."
- [2] arXiv:2406.12896 [pdf, html, other]
-
Title: Leveraging Pedagogical Theories to Understand Student Learning Process with Graph-based Reasonable Knowledge TracingComments: Preprint, accepted to appear in SIGKDD 2024, 12 pages. The source code is available at this https URL. Keywords: interpretable knowledge tracing, student behavior modeling, intelligence educationSubjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Knowledge tracing (KT) is a crucial task in intelligent education, focusing on predicting students' performance on given questions to trace their evolving knowledge. The advancement of deep learning in this field has led to deep-learning knowledge tracing (DLKT) models that prioritize high predictive accuracy. However, many existing DLKT methods overlook the fundamental goal of tracking students' dynamical knowledge mastery. These models do not explicitly model knowledge mastery tracing processes or yield unreasonable results that educators find difficulty to comprehend and apply in real teaching scenarios. In response, our research conducts a preliminary analysis of mainstream KT approaches to highlight and explain such unreasonableness. We introduce GRKT, a graph-based reasonable knowledge tracing method to address these issues. By leveraging graph neural networks, our approach delves into the mutual influences of knowledge concepts, offering a more accurate representation of how the knowledge mastery evolves throughout the learning process. Additionally, we propose a fine-grained and psychological three-stage modeling process as knowledge retrieval, memory strengthening, and knowledge learning/forgetting, to conduct a more reasonable knowledge tracing process. Comprehensive experiments demonstrate that GRKT outperforms eleven baselines across three datasets, not only enhancing predictive accuracy but also generating more reasonable knowledge tracing results. This makes our model a promising advancement for practical implementation in educational settings. The source code is available at this https URL.
- [3] arXiv:2406.12897 [pdf, html, other]
-
Title: Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and ExplainabilityComments: 31 pages including referencesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
It is imperative that breast cancer is detected precisely and timely to improve patient outcomes. Diagnostic methodologies have traditionally relied on unimodal approaches; however, medical data analytics is integrating diverse data sources beyond conventional imaging. Using multi-modal techniques, integrating both image and non-image data, marks a transformative advancement in breast cancer diagnosis. The purpose of this review is to explore the burgeoning field of multimodal techniques, particularly the fusion of histopathology images with non-image data. Further, Explainable AI (XAI) will be used to elucidate the decision-making processes of complex algorithms, emphasizing the necessity of explainability in diagnostic processes. This review utilizes multi-modal data and emphasizes explainability to enhance diagnostic accuracy, clinician confidence, and patient engagement, ultimately fostering more personalized treatment strategies for breast cancer, while also identifying research gaps in multi-modality and explainability, guiding future studies, and contributing to the strategic direction of the field.
- [4] arXiv:2406.12900 [pdf, html, other]
-
Title: Factor Graph Optimization of Error-Correcting Codes for Belief Propagation DecodingSubjects: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The design of optimal linear block codes capable of being efficiently decoded is of major concern, especially for short block lengths. As near capacity-approaching codes, Low-Density Parity-Check (LDPC) codes possess several advantages over other families of codes, the most notable being its efficient decoding via Belief Propagation. While many LDPC code design methods exist, the development of efficient sparse codes that meet the constraints of modern short code lengths and accommodate new channel models remains a challenge. In this work, we propose for the first time a data-driven approach for the design of sparse codes. We develop locally optimal codes with respect to Belief Propagation decoding via the learning on the Factor graph (also called the Tanner graph) under channel noise simulations. This is performed via a novel tensor representation of the Belief Propagation algorithm, optimized over finite fields via backpropagation coupled with an efficient line-search method. The proposed approach is shown to outperform the decoding performance of existing popular codes by orders of magnitude and demonstrates the power of data-driven approaches for code design.
- [5] arXiv:2406.12902 [pdf, html, other]
-
Title: Can AI Beat Undergraduates in Entry-level Java Assignments? Benchmarking Large Language Models on JavaBenchSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL); Software Engineering (cs.SE)
Code generation benchmarks such as HumanEval are widely adopted to evaluate LLMs' capabilities. However, after consolidating the latest 24 benchmarks, we noticed three significant imbalances. First, imbalanced programming language. 95.8% of benchmarks involve Python, while only 5 benchmarks involve Java. Second, imbalanced code granularity. Function-/statement-level benchmarks account for over 83.3% of benchmarks. Only a mere handful extends to class-/project-levels, and all are limited to Python. Third, lacking advanced features. Existing benchmarks primarily assess basic coding skills, while overlooking advanced Object-Oriented Programming (OOP) features (i.e., encapsulation, inheritance, and polymorphism).
To fill these gaps, we propose JavaBench, a project-level Java benchmark that exercises OOP features. It comprises four Java projects with 389 methods in 106 Java classes. The test coverage is up to 92%, and JavaBench is attested by 282 undergraduate students, reaching a 90.93/100 average score (i.e., pass rate against the test suite), ensuring the quality of documentation, code skeleton, and tests. To better evaluate LLM's capability against JavaBench, we introduce a systematic evaluation design covering three context settings and five synthesis strategies at two granularities using three hierarchical metrics. Our extensive experiment yields several interesting findings. First, we noticed that regarding project-level Java programming, LLMs are far behind undergraduate students (no project can be correctly completed by any studied LLMs, and at most 41.17% Pass@5 in a more relaxed evaluation). Second, using method signature as prompt context may strike an ideal balance for project-level code generation. JavaBench is publicly available at this https URL. - [6] arXiv:2406.12904 [pdf, html, other]
-
Title: Meent: Differentiable Electromagnetic Simulator for Machine LearningYongha Kim, Anthony W. Jung, Sanmun Kim, Kevin Octavian, Doyoung Heo, Chaejin Park, Jeongmin Shin, Sunghyun Nam, Chanhyung Park, Juho Park, Sangjun Han, Jinmyoung Lee, Seolho Kim, Min Seok Jang, Chan Y. ParkComments: under reviewSubjects: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Optics (physics.optics)
Electromagnetic (EM) simulation plays a crucial role in analyzing and designing devices with sub-wavelength scale structures such as solar cells, semiconductor devices, image sensors, future displays and integrated photonic devices. Specifically, optics problems such as estimating semiconductor device structures and designing nanophotonic devices provide intriguing research topics with far-reaching real world impact. Traditional algorithms for such tasks require iteratively refining parameters through simulations, which often yield sub-optimal results due to the high computational cost of both the algorithms and EM simulations. Machine learning (ML) emerged as a promising candidate to mitigate these challenges, and optics research community has increasingly adopted ML algorithms to obtain results surpassing classical methods across various tasks. To foster a synergistic collaboration between the optics and ML communities, it is essential to have an EM simulation software that is user-friendly for both research communities. To this end, we present Meent, an EM simulation software that employs rigorous coupled-wave analysis (RCWA). Developed in Python and equipped with automatic differentiation (AD) capabilities, Meent serves as a versatile platform for integrating ML into optics research and vice versa. To demonstrate its utility as a research platform, we present three applications of Meent: 1) generating a dataset for training neural operator, 2) serving as an environment for the reinforcement learning of nanophotonic device optimization, and 3) providing a solution for inverse problems with gradient-based optimizers. These applications highlight Meent's potential to advance both EM simulation and ML methodologies. The code is available at this https URL with the MIT license to promote the cross-polinations of ideas among academic researchers and industry practitioners.
- [7] arXiv:2406.12905 [pdf, html, other]
-
Title: PufferLib: Making Reinforcement Learning Libraries and Environments Play NiceSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
You have an environment, a model, and a reinforcement learning library that are designed to work together but don't. PufferLib makes them play nice. The library provides one-line environment wrappers that eliminate common compatibility problems and fast vectorization to accelerate training. With PufferLib, you can use familiar libraries like CleanRL and SB3 to scale from classic benchmarks like Atari and Procgen to complex simulators like NetHack and Neural MMO. We release pip packages and prebuilt images with dependencies for dozens of environments. All of our code is free and open-source software under the MIT license, complete with baselines, documentation, and support at this http URL.
- [8] arXiv:2406.12907 [pdf, html, other]
-
Title: Reconciling Kaplan and Chinchilla Scaling LawsSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Kaplan et al. [2020] (`Kaplan') and Hoffmann et al. [2022] (`Chinchilla') studied the scaling behavior of transformers trained on next-token language prediction. These studies produced different estimates for how the number of parameters ($N$) and training tokens ($D$) should be set to achieve the lowest possible loss for a given compute budget ($C$). Kaplan: $N_\text{optimal} \propto C^{0.73}$, Chinchilla: $N_\text{optimal} \propto C^{0.50}$. This note finds that much of this discrepancy can be attributed to Kaplan counting non-embedding rather than total parameters, combined with their analysis being performed at small scale. Simulating the Chinchilla study under these conditions produces biased scaling coefficients close to Kaplan's. Hence, this note reaffirms Chinchilla's scaling coefficients, by explaining the cause of Kaplan's original overestimation.
- [9] arXiv:2406.12908 [pdf, html, other]
-
Title: Rating Multi-Modal Time-Series Forecasting Models (MM-TSFM) for Robustness Through a Causal LensKausik Lakkaraju, Rachneet Kaur, Zhen Zeng, Parisa Zehtabi, Sunandita Patra, Biplav Srivastava, Marco ValtortaSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
AI systems are notorious for their fragility; minor input changes can potentially cause major output swings. When such systems are deployed in critical areas like finance, the consequences of their uncertain behavior could be severe. In this paper, we focus on multi-modal time-series forecasting, where imprecision due to noisy or incorrect data can lead to erroneous predictions, impacting stakeholders such as analysts, investors, and traders. Recently, it has been shown that beyond numeric data, graphical transformations can be used with advanced visual models to achieve better performance. In this context, we introduce a rating methodology to assess the robustness of Multi-Modal Time-Series Forecasting Models (MM-TSFM) through causal analysis, which helps us understand and quantify the isolated impact of various attributes on the forecasting accuracy of MM-TSFM. We apply our novel rating method on a variety of numeric and multi-modal forecasting models in a large experimental setup (six input settings of control and perturbations, ten data distributions, time series from six leading stocks in three industries over a year of data, and five time-series forecasters) to draw insights on robust forecasting models and the context of their strengths. Within the scope of our study, our main result is that multi-modal (numeric + visual) forecasting, which was found to be more accurate than numeric forecasting in previous studies, can also be more robust in diverse settings. Our work will help different stakeholders of time-series forecasting understand the models` behaviors along trust (robustness) and accuracy dimensions to select an appropriate model for forecasting using our rating method, leading to improved decision-making.
- [10] arXiv:2406.12909 [pdf, html, other]
-
Title: Scalable Training of Graph Foundation Models for Atomistic Materials Modeling: A Case Study with HydraGNNMassimiliano Lupo Pasini, Jong Youl Choi, Kshitij Mehta, Pei Zhang, David Rogers, Jonghyun Bae, Khaled Z. Ibrahim, Ashwin M. Aji, Karl W. Schulz, Jorda Polo, Prasanna BalaprakashComments: 16 pages, 13 figuresSubjects: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
We present our work on developing and training scalable graph foundation models (GFM) using HydraGNN, a multi-headed graph convolutional neural network architecture. HydraGNN expands the boundaries of graph neural network (GNN) in both training scale and data diversity. It abstracts over message passing algorithms, allowing both reproduction of and comparison across algorithmic innovations that define convolution in GNNs. This work discusses a series of optimizations that have allowed scaling up the GFM training to tens of thousands of GPUs on datasets that consist of hundreds of millions of graphs. Our GFMs use multi-task learning (MTL) to simultaneously learn graph-level and node-level properties of atomistic structures, such as the total energy and atomic forces. Using over 150 million atomistic structures for training, we illustrate the performance of our approach along with the lessons learned on two United States Department of Energy (US-DOE) supercomputers, namely the Perlmutter petascale system at the National Energy Research Scientific Computing Center and the Frontier exascale system at Oak Ridge National Laboratory. The HydraGNN architecture enables the GFM to achieve near-linear strong scaling performance using more than 2,000 GPUs on Perlmutter and 16,000 GPUs on Frontier. Hyperparameter optimization (HPO) was performed on over 64,000 GPUs on Frontier to select GFM architectures with high accuracy. Early stopping was applied on each GFM architecture for energy awareness in performing such an extreme-scale task. The training of an ensemble of highest-ranked GFM architectures continued until convergence to establish uncertainty quantification (UQ) capabilities with ensemble learning. Our contribution opens the door for rapidly developing, training, and deploying GFMs using large-scale computational resources to enable AI-accelerated materials discovery and design.
- [11] arXiv:2406.12910 [pdf, other]
-
Title: Human-level molecular optimization driven by mol-gene evolutionJiebin Fang (1 and 2), Churu Mao (2), Yuchen Zhu (3), Xiaoming Chen (2), Chang-Yu Hsieh (3), Zhongjun Ma (1 and 2) ((1) Hainan Institute of Zhejiang University, (2) Institute of Marine Biology and Pharmacology, Ocean College, Zhejiang University, (3) College of Pharmaceutical Sciences and Cancer Center, Zhejiang University)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
De novo molecule generation allows the search for more drug-like hits across a vast chemical space. However, lead optimization is still required, and the process of optimizing molecular structures faces the challenge of balancing structural novelty with pharmacological properties. This study introduces the Deep Genetic Molecular Modification Algorithm (DGMM), which brings structure modification to the level of medicinal chemists. A discrete variational autoencoder (D-VAE) is used in DGMM to encode molecules as quantization code, mol-gene, which incorporates deep learning into genetic algorithms for flexible structural optimization. The mol-gene allows for the discovery of pharmacologically similar but structurally distinct compounds, and reveals the trade-offs of structural optimization in drug discovery. We demonstrate the effectiveness of the DGMM in several applications.
- [12] arXiv:2406.12911 [pdf, html, other]
-
Title: The Promise of Analog Deep Learning: Recent Advances, Challenges and OpportunitiesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Much of the present-day Artificial Intelligence (AI) utilizes artificial neural networks, which are sophisticated computational models designed to recognize patterns and solve complex problems by learning from data. However, a major bottleneck occurs during a device's calculation of weighted sums for forward propagation and optimization procedure for backpropagation, especially for deep neural networks, or networks with numerous layers. Exploration into different methods of implementing neural networks is necessary for further advancement of the area. While a great deal of research into AI hardware in both directions, analog and digital implementation widely exists, much of the existing survey works lacks discussion on the progress of analog deep learning. To this end, we attempt to evaluate and specify the advantages and disadvantages, along with the current progress with regards to deep learning, for analog implementations. In this paper, our focus lies on the comprehensive examination of eight distinct analog deep learning methodologies across multiple key parameters. These parameters include attained accuracy levels, application domains, algorithmic advancements, computational speed, and considerations of energy efficiency and power consumption. We also identify the neural network-based experiments implemented using these hardware devices and discuss comparative performance achieved by the different analog deep learning methods along with an analysis of their current limitations. Overall, we find that Analog Deep Learning has great potential for future consumer-level applications, but there is still a long road ahead in terms of scalability. Most of the current implementations are more proof of concept and are not yet practically deployable for large-scale models.
- [13] arXiv:2406.12913 [pdf, html, other]
-
Title: T-JEPA: A Joint-Embedding Predictive Architecture for Trajectory Similarity ComputationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Trajectory similarity computation is an essential technique for analyzing moving patterns of spatial data across various applications such as traffic management, wildlife tracking, and location-based services. Modern methods often apply deep learning techniques to approximate heuristic metrics but struggle to learn more robust and generalized representations from the vast amounts of unlabeled trajectory data. Recent approaches focus on self-supervised learning methods such as contrastive learning, which have made significant advancements in trajectory representation learning. However, contrastive learning-based methods heavily depend on manually pre-defined data augmentation schemes, limiting the diversity of generated trajectories and resulting in learning from such variations in 2D Euclidean space, which prevents capturing high-level semantic variations. To address these limitations, we propose T-JEPA, a self-supervised trajectory similarity computation method employing Joint-Embedding Predictive Architecture (JEPA) to enhance trajectory representation learning. T-JEPA samples and predicts trajectory information in representation space, enabling the model to infer the missing components of trajectories at high-level semantics without relying on domain knowledge or manual effort. Extensive experiments conducted on three urban trajectory datasets and two Foursquare datasets demonstrate the effectiveness of T-JEPA in trajectory similarity computation.
- [14] arXiv:2406.12914 [pdf, html, other]
-
Title: The Significance of Latent Data Divergence in Predicting System DegradationComments: 16 pagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Condition-Based Maintenance is pivotal in enabling the early detection of potential failures in engineering systems, where precise prediction of the Remaining Useful Life is essential for effective maintenance and operation. However, a predominant focus in the field centers on predicting the Remaining Useful Life using unprocessed or minimally processed data, frequently neglecting the intricate dynamics inherent in the dataset. In this work we introduce a novel methodology grounded in the analysis of statistical similarity within latent data from system components. Leveraging a specifically designed architecture based on a Vector Quantized Variational Autoencoder, we create a sequence of discrete vectors which is used to estimate system-specific priors. We infer the similarity between systems by evaluating the divergence of these priors, offering a nuanced understanding of individual system behaviors. The efficacy of our approach is demonstrated through experiments on the NASA commercial modular aero-propulsion system simulation (C-MAPSS) dataset. Our validation not only underscores the potential of our method in advancing the study of latent statistical divergence but also demonstrates its superiority over existing techniques.
- [15] arXiv:2406.12915 [pdf, html, other]
-
Title: GROD: Enhancing Generalization of Transformer with Out-of-Distribution DetectionSubjects: Machine Learning (cs.LG); Probability (math.PR)
Transformer networks excel in natural language processing (NLP) and computer vision (CV) tasks. However, they face challenges in generalizing to Out-of-Distribution (OOD) datasets, that is, data whose distribution differs from that seen during training. The OOD detection aims to distinguish data that deviates from the expected distribution, while maintaining optimal performance on in-distribution (ID) data. This paper introduces a novel approach based on OOD detection, termed the Generate Rounded OOD Data (GROD) algorithm, which significantly bolsters the generalization performance of transformer networks across various tasks. GROD is motivated by our new OOD detection Probably Approximately Correct (PAC) Theory for transformer. The transformer has learnability in terms of OOD detection that is, when the data is sufficient the outlier can be well represented. By penalizing the misclassification of OOD data within the loss function and generating synthetic outliers, GROD guarantees learnability and refines the decision boundaries between inlier and outlier. This strategy demonstrates robust adaptability and general applicability across different data types. Evaluated across diverse OOD detection tasks in NLP and CV, GROD achieves SOTA regardless of data format. On average, it reduces the SOTA FPR@95 from 21.97% to 0.12%, and improves AUROC from 93.62% to 99.98% on image classification tasks, and the SOTA FPR@95 by 12.89% and AUROC by 2.27% in detecting semantic text outliers. The code is available at https://anonymous.4open.science/r/GROD-OOD-Detection-with-transformers-B70F.
- [16] arXiv:2406.12916 [pdf, html, other]
-
Title: Opening the Black Box: predicting the trainability of deep neural networks with reconstruction entropyComments: 22 pages, 5 figures, 1 tableSubjects: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); High Energy Physics - Theory (hep-th); Machine Learning (stat.ML)
An important challenge in machine learning is to predict the initial conditions under which a given neural network will be trainable. We present a method for predicting the trainable regime in parameter space for deep feedforward neural networks, based on reconstructing the input from subsequent activation layers via a cascade of single-layer auxiliary networks. For both MNIST and CIFAR10, we show that a single epoch of training of the shallow cascade networks is sufficient to predict the trainability of the deep feedforward network, thereby providing a significant reduction in overall training time. We achieve this by computing the relative entropy between reconstructed images and the original inputs, and show that this probe of information loss is sensitive to the phase behaviour of the network. Our results provide a concrete link between the flow of information and the trainability of deep neural networks, further elucidating the role of criticality in these systems.
- [17] arXiv:2406.12918 [pdf, html, other]
-
Title: Brain-Inspired Spike Echo State Network Dynamics for Aero-Engine Intelligent Fault PredictionSubjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Aero-engine fault prediction aims to accurately predict the development trend of the future state of aero-engines, so as to diagnose faults in advance. Traditional aero-engine parameter prediction methods mainly use the nonlinear mapping relationship of time series data but generally ignore the adequate spatiotemporal features contained in aero-engine data. To this end, we propose a brain-inspired spike echo state network (Spike-ESN) model for aero-engine intelligent fault prediction, which is used to effectively capture the evolution process of aero-engine time series data in the framework of spatiotemporal dynamics. In the proposed approach, we design a spike input layer based on Poisson distribution inspired by the spike neural encoding mechanism of biological neurons, which can extract the useful temporal characteristics in aero-engine sequence data. Then, the temporal characteristics are input into a spike reservoir through the current calculation method of spike accumulation in neurons, which projects the data into a high-dimensional sparse space. In addition, we use the ridge regression method to read out the internal state of the spike reservoir. Finally, the experimental results of aero-engine states prediction demonstrate the superiority and potential of the proposed method.
- [18] arXiv:2406.12919 [pdf, html, other]
-
Title: Understanding active learning of molecular docking and its applicationsSubjects: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
With the advancing capabilities of computational methodologies and resources, ultra-large-scale virtual screening via molecular docking has emerged as a prominent strategy for in silico hit discovery. Given the exhaustive nature of ultra-large-scale virtual screening, active learning methodologies have garnered attention as a means to mitigate computational cost through iterative small-scale docking and machine learning model training. While the efficacy of active learning methodologies has been empirically validated in extant literature, a critical investigation remains in how surrogate models can predict docking score without considering three-dimensional structural features, such as receptor conformation and binding poses. In this paper, we thus investigate how active learning methodologies effectively predict docking scores using only 2D structures and under what circumstances they may work particularly well through benchmark studies encompassing six receptor targets. Our findings suggest that surrogate models tend to memorize structural patterns prevalent in high docking scored compounds obtained during acquisition steps. Despite this tendency, surrogate models demonstrate utility in virtual screening, as exemplified in the identification of actives from DUD-E dataset and high docking-scored compounds from EnamineReal library, a significantly larger set than the initial screening pool. Our comprehensive analysis underscores the reliability and potential applicability of active learning methodologies in virtual screening campaigns.
- [19] arXiv:2406.12921 [pdf, html, other]
-
Title: WindowMixer: Intra-Window and Inter-Window Modeling for Time Series ForecastingSubjects: Machine Learning (cs.LG)
Time series forecasting (TSF) is crucial in fields like economic forecasting, weather prediction, traffic flow analysis, and public health surveillance. Real-world time series data often include noise, outliers, and missing values, making accurate forecasting challenging. Traditional methods model point-to-point relationships, which limits their ability to capture complex temporal patterns and increases their susceptibility to this http URL address these issues, we introduce the WindowMixer model, built on an all-MLP framework. WindowMixer leverages the continuous nature of time series by examining temporal variations from a window-based perspective. It decomposes time series into trend and seasonal components, handling them individually. For trends, a fully connected (FC) layer makes predictions. For seasonal components, time windows are projected to produce window tokens, processed by Intra-Window-Mixer and Inter-Window-Mixer modules. The Intra-Window-Mixer models relationships within each window, while the Inter-Window-Mixer models relationships between windows. This approach captures intricate patterns and long-range dependencies in the data.Experiments show WindowMixer consistently outperforms existing methods in both long-term and short-term forecasting tasks.
- [20] arXiv:2406.12923 [pdf, html, other]
-
Title: Interpretable Cascading Mixture-of-Experts for Urban Traffic Congestion PredictionSubjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Rapid urbanization has significantly escalated traffic congestion, underscoring the need for advanced congestion prediction services to bolster intelligent transportation systems. As one of the world's largest ride-hailing platforms, DiDi places great emphasis on the accuracy of congestion prediction to enhance the effectiveness and reliability of their real-time services, such as travel time estimation and route planning. Despite numerous efforts have been made on congestion prediction, most of them fall short in handling heterogeneous and dynamic spatio-temporal dependencies (e.g., periodic and non-periodic congestions), particularly in the presence of noisy and incomplete traffic data. In this paper, we introduce a Congestion Prediction Mixture-of-Experts, CP-MoE, to address the above challenges. We first propose a sparsely-gated Mixture of Adaptive Graph Learners (MAGLs) with congestion-aware inductive biases to improve the model capacity for efficiently capturing complex spatio-temporal dependencies in varying traffic scenarios. Then, we devise two specialized experts to help identify stable trends and periodic patterns within the traffic data, respectively. By cascading these experts with MAGLs, CP-MoE delivers congestion predictions in a more robust and interpretable manner. Furthermore, an ordinal regression strategy is adopted to facilitate effective collaboration among diverse experts. Extensive experiments on real-world datasets demonstrate the superiority of our proposed method compared with state-of-the-art spatio-temporal prediction models. More importantly, CP-MoE has been deployed in DiDi to improve the accuracy and reliability of the travel time estimation system.
- [21] arXiv:2406.12925 [pdf, html, other]
-
Title: GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction TasksComments: 11 pages, 1 figure, 6 tablesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Information extraction tasks require both accurate, efficient, and generalisable models. Classical supervised deep learning approaches can achieve the required performance, but they need large datasets and are limited in their ability to adapt to different tasks. On the other hand, large language models (LLMs) demonstrate good generalization, meaning that they can adapt to many different tasks based on user requests. However, LLMs are computationally expensive and tend to fail to generate structured outputs. In this article, we will introduce a new kind of GLiNER model that can be used for various information extraction tasks while being a small encoder model. Our model achieved SoTA performance on zero-shot NER benchmarks and leading performance on question-answering, summarization and relation extraction tasks. Additionally, in this article, we will cover experimental results on self-learning approaches for named entity recognition using GLiNER models.
- [22] arXiv:2406.12928 [pdf, html, other]
-
Title: Evaluating the Generalization Ability of Quantized LLMs: Benchmark, Analysis, and ToolboxYijun Liu, Yuan Meng, Fang Wu, Shenhao Peng, Hang Yao, Chaoyu Guan, Chen Tang, Xinzhu Ma, Zhi Wang, Wenwu ZhuSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Large language models (LLMs) have exhibited exciting progress in multiple scenarios, while the huge computational demands hinder their deployments in lots of real-world applications. As an effective means to reduce memory footprint and inference cost, quantization also faces challenges in performance degradation at low bit-widths. Understanding the impact of quantization on LLM capabilities, especially the generalization ability, is crucial. However, the community's main focus remains on the algorithms and models of quantization, with insufficient attention given to whether the quantized models can retain the strong generalization abilities of LLMs. In this work, we fill this gap by providing a comprehensive benchmark suite for this research topic, including an evaluation system, detailed analyses, and a general toolbox. Specifically, based on the dominant pipeline in LLM quantization, we primarily explore the impact of calibration data distribution on the generalization of quantized LLMs and conduct the benchmark using more than 40 datasets within two main scenarios. Based on this benchmark, we conduct extensive experiments with two well-known LLMs (English and Chinese) and four quantization algorithms to investigate this topic in-depth, yielding several counter-intuitive and valuable findings, e.g., models quantized using a calibration set with the same distribution as the test data are not necessarily optimal. Besides, to facilitate future research, we also release a modular-designed toolbox, which decouples the overall pipeline into several separate components, e.g., base LLM module, dataset module, quantizer module, etc. and allows subsequent researchers to easily assemble their methods through a simple configuration. Our benchmark suite is publicly available at this https URL
- [23] arXiv:2406.12929 [pdf, html, other]
-
Title: RMF: A Risk Measurement Framework for Machine Learning ModelsComments: Accepted at CSA@ARES 2024Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Machine learning (ML) models are used in many safety- and security-critical applications nowadays. It is therefore important to measure the security of a system that uses ML as a component. This paper focuses on the field of ML, particularly the security of autonomous vehicles. For this purpose, a technical framework will be described, implemented, and evaluated in a case study. Based on ISO/IEC 27004:2016, risk indicators are utilized to measure and evaluate the extent of damage and the effort required by an attacker. It is not possible, however, to determine a single risk value that represents the attacker's effort. Therefore, four different values must be interpreted individually.
- [24] arXiv:2406.12930 [pdf, html, other]
-
Title: Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime RequantizationComments: To appear at the 51st International Symposium on Computer Architecture (ISCA 2024)Subjects: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning and have thus become one of the most important workloads in today's computing landscape. However, deploying LLM inference poses challenges due to the high compute and memory requirements stemming from the enormous model size and the difficulty of running it in the integer pipelines. In this paper, we present Tender, an algorithm-hardware co-design solution that enables efficient deployment of LLM inference at low precision. Based on our analysis of outlier values in LLMs, we propose a decomposed quantization technique in which the scale factors of decomposed matrices are powers of two apart. The proposed scheme allows us to avoid explicit requantization (i.e., dequantization/quantization) when accumulating the partial sums from the decomposed matrices, with a minimal extension to the commodity tensor compute hardware. Our evaluation shows that Tender achieves higher accuracy and inference performance compared to the state-of-the-art methods while also being significantly less intrusive to the existing accelerators.
- [25] arXiv:2406.12934 [pdf, html, other]
-
Title: Current state of LLM Risks and AI GuardrailsComments: Independent study, Exploring LLMs, Deploying LLMs and their RisksSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Large language models (LLMs) have become increasingly sophisticated, leading to widespread deployment in sensitive applications where safety and reliability are paramount. However, LLMs have inherent risks accompanying them, including bias, potential for unsafe actions, dataset poisoning, lack of explainability, hallucinations, and non-reproducibility. These risks necessitate the development of "guardrails" to align LLMs with desired behaviors and mitigate potential harm.
This work explores the risks associated with deploying LLMs and evaluates current approaches to implementing guardrails and model alignment techniques. We examine intrinsic and extrinsic bias evaluation methods and discuss the importance of fairness metrics for responsible AI development. The safety and reliability of agentic LLMs (those capable of real-world actions) are explored, emphasizing the need for testability, fail-safes, and situational awareness.
Technical strategies for securing LLMs are presented, including a layered protection model operating at external, secondary, and internal levels. System prompts, Retrieval-Augmented Generation (RAG) architectures, and techniques to minimize bias and protect privacy are highlighted.
Effective guardrail design requires a deep understanding of the LLM's intended use case, relevant regulations, and ethical considerations. Striking a balance between competing requirements, such as accuracy and privacy, remains an ongoing challenge. This work underscores the importance of continuous research and development to ensure the safe and responsible use of LLMs in real-world applications. - [26] arXiv:2406.12935 [pdf, html, other]
-
Title: ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat TemplatesSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Large language models (LLMs) are expected to follow instructions from users and engage in conversations. Techniques to enhance LLMs' instruction-following capabilities typically fine-tune them using data structured according to a predefined chat template. Although chat templates are shown to be effective in optimizing LLM performance, their impact on safety alignment of LLMs has been less understood, which is crucial for deploying LLMs safely at scale.
In this paper, we investigate how chat templates affect safety alignment of LLMs. We identify a common vulnerability, named ChatBug, that is introduced by chat templates. Our key insight to identify ChatBug is that the chat templates provide a rigid format that need to be followed by LLMs, but not by users. Hence, a malicious user may not necessarily follow the chat template when prompting LLMs. Instead, malicious users could leverage their knowledge of the chat template and accordingly craft their prompts to bypass safety alignments of LLMs. We develop two attacks to exploit the ChatBug vulnerability. We demonstrate that a malicious user can exploit the ChatBug vulnerability of eight state-of-the-art (SOTA) LLMs and effectively elicit unintended responses from these models. Moreover, we show that ChatBug can be exploited by existing jailbreak attacks to enhance their attack success rates. We investigate potential countermeasures to ChatBug. Our results show that while adversarial training effectively mitigates the ChatBug vulnerability, the victim model incurs significant performance degradation. These results highlight the trade-off between safety alignment and helpfulness. Developing new methods for instruction tuning to balance this trade-off is an open and critical direction for future research - [27] arXiv:2406.12938 [pdf, other]
-
Title: Security in IS and social engineering -- an overview and state of the artFlorence Sèdes (UT3, IRIT, CNRS)Comments: in French language, INFORSID 2024, May 2024, Nancy, FranceSubjects: Cryptography and Security (cs.CR); Databases (cs.DB)
Major transformations related to information technologies affect InformationSystems (IS) that support the business processes of organizations and their actors. Deployment in a complex environment involving sensitive, massive and heterogeneous data generates risks with legal, social and financial impacts. This context of transition and openness makes the security of these IS central to the concerns of organizations. The digitization of all processes and the opening to IoT devices (Internet of Things) has fostered the emergence of a new formof crime, i.e. cybercrime.This generic term covers a number of malicious acts, the majority of which are now perpetrated using social engineering strategies, a phenomenon enabling a combined exploitation of ``human'' vulnerabilities and digital tools. The maliciousness of such attacks lies in the fact that they turn users into facilitators of cyber-attacks, to the point of being perceived as the ``weak link'' of this http URL deployment policies prove insufficient, it is necessary to think about upstream steps: knowing how to anticipate, identifying weak signals and outliers, detect early and react quickly to computer crime are therefore priority issues requiring a prevention and cooperation this http URL this overview, we propose a synthesis of literature and professional practices on this subject.
- [28] arXiv:2406.12944 [pdf, html, other]
-
Title: Semantic Graph Consistency: Going Beyond Patches for Regularizing Self-Supervised Vision TransformersSubjects: Computer Vision and Pattern Recognition (cs.CV)
Self-supervised learning (SSL) with vision transformers (ViTs) has proven effective for representation learning as demonstrated by the impressive performance on various downstream tasks. Despite these successes, existing ViT-based SSL architectures do not fully exploit the ViT backbone, particularly the patch tokens of the ViT. In this paper, we introduce a novel Semantic Graph Consistency (SGC) module to regularize ViT-based SSL methods and leverage patch tokens effectively. We reconceptualize images as graphs, with image patches as nodes and infuse relational inductive biases by explicit message passing using Graph Neural Networks into the SSL framework. Our SGC loss acts as a regularizer, leveraging the underexploited patch tokens of ViTs to construct a graph and enforcing consistency between graph features across multiple views of an image. Extensive experiments on various datasets including ImageNet, RESISC and Food-101 show that our approach significantly improves the quality of learned representations, resulting in a 5-10\% increase in performance when limited labeled data is used for linear evaluation. These experiments coupled with a comprehensive set of ablations demonstrate the promise of our approach in various settings.
- [29] arXiv:2406.12945 [pdf, other]
-
Title: Under the Hood of Tabular Data Generation Models: the Strong Impact of Hyperparameter TuningG. Charbel N. Kindji (IRISA, LACODAM), Lina Maria Rojas-Barahona, Elisa Fromont (IRISA, LACODAM), Tanguy UrvoySubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We investigate the impact of dataset-specific hyperparameter, feature encoding, and architecture tuning on five recent model families for tabular data generation through an extensive benchmark on 16 datasets. This study addresses the practical need for a unified evaluation of models that fully considers hyperparameter optimization. Additionally, we propose a reduced search space for each model that allows for quick optimization, achieving nearly equivalent performance at a significantly lower cost.Our benchmark demonstrates that, for most models, large-scale dataset-specific tuning substantially improves performance compared to the original configurations. Furthermore, we confirm that diffusion-based models generally outperform other models on tabular data. However, this advantage is not significant when the entire tuning and training process is restricted to the same GPU budget for all models.
- [30] arXiv:2406.12947 [pdf, html, other]
-
Title: AutoFirm: Automatically Identifying Reused Libraries inside IoT Firmware at Large-ScaleComments: 13 pages, 20 figuresSubjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
The Internet of Things (IoT) has become indispensable to our daily lives and work. Unfortunately, developers often reuse software libraries in the IoT firmware, leading to a major security concern. If vulnerabilities or insecure versions of these libraries go unpatched, a massive number of IoT devices can be impacted. In this paper, we propose the AutoFirm, an automated tool for detecting reused libraries in IoT firmware at a large scale. Specifically, AutoFirm leverages the syntax information (library name and version) to determine whether IoT firmware reuses the libraries. We conduct a large-scale empirical study of reused libraries of IoT firmware, investigating more than 6,900+ firmware and 2,700+ distinct vulnerabilities affecting 11,300+ vulnerable versions from 349 open-source software libraries. Leveraging this diverse information set, we conduct a qualitative assessment of vulnerable library versions to understand security gaps and the misplaced trust of libraries in IoT firmware. Our research reveals that: manufacturers neglected to update outdated libraries for IoT firmware in 67.3\% of cases; on average, outdated libraries persisted for over 1.34 years prior to remediation; vulnerabilities of software libraries have posed server threats to widespread IoT devices.
- [31] arXiv:2406.12948 [pdf, other]
-
Title: New Reservoir Computing Kernel Based on Chaotic Chua Circuit and Investigating Application to Post-Quantum CryptographyComments: 72 pages, 62 figures, Master of Engineering final project reportSubjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD); Applied Physics (physics.app-ph); Classical Physics (physics.class-ph)
The aim of this project was to develop a new Reservoir Computer implementation, based on a chaotic Chua circuit. In addition to suitable classification and regression benchmarks, the Reservoir Computer was applied to Post-Quantum Cryptography, with its suitability for this application investigated and assessed. The cryptographic algorithm utilised was the Learning with Errors problem, for both encryption and decryption. To achieve this, the Chua circuit was characterised, in simulation, and by physical circuit testing. The Reservoir Computer was designed and implemented using the results of the characterisation. As part of this development, noise was considered and mitigated.
The benchmarks demonstrate that the Reservoir Computer can achieve current literature benchmarks with low error. However, the results with Learning with Errors suggest that a Chua-based Reservoir Computer is not sufficiently complex to tackle the high non-linearity in Post-Quantum Cryptography. Future work would involve researching the use of different combinations of multiple Chua Reservoir Computers in larger neural network architectures. Such architectures may produce the required high-dimensional behaviour to achieve the Learning with Errors problem.
This project is believed to be only the second instance of a Chua-based Reservoir Computer in academia, and it is the first to be applied to challenging real-world tasks such as Post-Quantum Cryptography. It is also original by its investigation of hitherto unexplored parameters, and their impact on performance. It demonstrates a proof-of-concept for a mass-producible, inexpensive, low-power consumption hardware neural network. It also enables the next stages in research to occur, paving the road for using Chua-based Reservoir Computers across various applications. - [32] arXiv:2406.12952 [pdf, html, other]
-
Title: Code Agents are State of the Art Software TestersComments: 20 pages, 14 figures, 7 tablesSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Rigorous software testing is crucial for developing and maintaining high-quality code, making automated test generation a promising avenue for both improving software quality and boosting the effectiveness of code generation methods. However, while code generation with Large Language Models (LLMs) is an extraordinarily active research area, test generation remains relatively unexplored. We address this gap and investigate the capability of LLM-based Code Agents for formalizing user issues into test cases. To this end, we propose a novel benchmark based on popular GitHub repositories, containing real-world issues, ground-truth patches, and golden tests. We find that LLMs generally perform surprisingly well at generating relevant test cases with Code Agents designed for code repair exceeding the performance of systems designed specifically for test generation. Further, as test generation is a similar but more structured task than code generation, it allows for a more fine-grained analysis using fail-to-pass rate and coverage metrics, providing a dual metric for analyzing systems designed for code repair. Finally, we find that generated tests are an effective filter for proposed code fixes, doubling the precision of SWE-Agent.
- [33] arXiv:2406.12953 [pdf, html, other]
-
Title: Pattern or Artifact? Interactively Exploring Embedding Quality with TRACEEdith Heiter, Liesbet Martens, Ruth Seurinck, Martin Guilliams, Tijl De Bie, Yvan Saeys, Jefrey LijffijtComments: 4 pages, 3 figures, Accepted at ECML-PKDD 2024. For a demo video, see this https URL. Code is available at this https URLSubjects: Graphics (cs.GR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
This paper presents TRACE, a tool to analyze the quality of 2D embeddings generated through dimensionality reduction techniques. Dimensionality reduction methods often prioritize preserving either local neighborhoods or global distances, but insights from visual structures can be misleading if the objective has not been achieved uniformly. TRACE addresses this challenge by providing a scalable and extensible pipeline for computing both local and global quality measures. The interactive browser-based interface allows users to explore various embeddings while visually assessing the pointwise embedding quality. The interface also facilitates in-depth analysis by highlighting high-dimensional nearest neighbors for any group of points and displaying high-dimensional distances between points. TRACE enables analysts to make informed decisions regarding the most suitable dimensionality reduction method for their specific use case, by showing the degree and location where structure is preserved in the reduced space.
- [34] arXiv:2406.12954 [pdf, html, other]
-
Title: Skin Cancer Images Classification using Transfer Learning TechniquesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Skin cancer is one of the most common and deadliest types of cancer. Early diagnosis of skin cancer at a benign stage is critical to reducing cancer mortality. To detect skin cancer at an earlier stage an automated system is compulsory that can save the life of many patients. Many previous studies have addressed the problem of skin cancer diagnosis using various deep learning and transfer learning models. However, existing literature has limitations in its accuracy and time-consuming procedure. In this work, we applied five different pre-trained transfer learning approaches for binary classification of skin cancer detection at benign and malignant stages. To increase the accuracy of these models we fine-tune different layers and activation functions. We used a publicly available ISIC dataset to evaluate transfer learning approaches. For model stability, data augmentation techniques are applied to improve the randomness of the input dataset. These approaches are evaluated using different hyperparameters such as batch sizes, epochs, and optimizers. The experimental results show that the ResNet-50 model provides an accuracy of 0.935, F1-score of 0.86, and precision of 0.94.
- [35] arXiv:2406.12955 [pdf, other]
-
Title: An End-to-End Coding Scheme for DNA-Based Data Storage With Nanopore-Sequenced ReadsLorenz Welter, Roman Sokolovskii, Thomas Heinis, Antonia Wachter-Zeh, Eirik Rosnes, Alexandre Graell i AmatSubjects: Information Theory (cs.IT)
We consider error-correcting coding for deoxyribonucleic acid (DNA)-based storage using nanopore sequencing. We model the DNA storage channel as a sampling noise channel where the input data is chunked into $M$ short DNA strands, which are copied a random number of times, and the channel outputs a random selection of $N$ noisy DNA strands. The retrieved DNA reads are prone to strand-dependent insertion, deletion, and substitution (IDS) errors. We construct an index-based concatenated coding scheme consisting of the concatenation of an outer code, an index code, and an inner code. We further propose a low-complexity (linear in $N$) maximum a posteriori probability decoder that takes into account the strand-dependent IDS errors and the randomness of the drawing to infer symbolwise a posteriori probabilities for the outer decoder. We present Monte-Carlo simulations for information-outage probabilities and frame error rates for different channel setups on experimental data. We finally evaluate the overall system performance using the read/write cost trade-off. A powerful combination of tailored channel modeling and soft information processing allows us to achieve excellent performance even with error-prone nanopore-sequenced reads outperforming state-of-the-art schemes.%
- [36] arXiv:2406.12975 [pdf, html, other]
-
Title: SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text GenerationSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Large Language Models (LLMs) have transformed machine learning but raised significant legal concerns due to their potential to produce text that infringes on copyrights, resulting in several high-profile lawsuits. The legal landscape is struggling to keep pace with these rapid advancements, with ongoing debates about whether generated text might plagiarize copyrighted materials. Current LLMs may infringe on copyrights or overly restrict non-copyrighted texts, leading to these challenges: (i) the need for a comprehensive evaluation benchmark to assess copyright compliance from multiple aspects; (ii) evaluating robustness against safeguard bypassing attacks; and (iii) developing effective defenses targeted against the generation of copyrighted text. To tackle these challenges, we introduce a curated dataset to evaluate methods, test attack strategies, and propose lightweight, real-time defenses to prevent the generation of copyrighted text, ensuring the safe and lawful use of LLMs. Our experiments demonstrate that current LLMs frequently output copyrighted text, and that jailbreaking attacks can significantly increase the volume of copyrighted output. Our proposed defense mechanisms significantly reduce the volume of copyrighted text generated by LLMs by effectively refusing malicious requests. Code is publicly available at this https URL
- [37] arXiv:2406.12991 [pdf, html, other]
-
Title: Variational multirate integratorsSubjects: Numerical Analysis (math.NA)
The simulation of systems that act on multiple time scales is challenging. A stable integration of the fast dynamics requires a highly accurate approximation whereas for the simulation of the slow part, a coarser approximation is accurate enough. With regard to the general goals of any numerical method, high accuracy and low computational costs, a popular approach is to treat the slow and the fast part of a system differently. Embedding this approach in a variational framework is the keystone of this work. By paralleling continuous and discrete variational multirate dynamics, integrators are derived on a time grid consisting of macro and micro time nodes that are symplectic, momentum preserving and also exhibit good energy behaviour. The choice of the discrete approximations for the action determines the convergence order of the scheme as well as its implicit or explicit nature for the different parts of the multirate system. The convergence order is proven using the theory of variational error analysis. The performance of the multirate variational integrators is demonstrated by means of several examples.
- [38] arXiv:2406.12992 [pdf, html, other]
-
Title: Additive regularization schedule for neural architecture searchSubjects: Machine Learning (cs.LG)
Neural network structures have a critical impact on the accuracy and stability of forecasting. Neural architecture search procedures help design an optimal neural network according to some loss function, which represents a set of quality criteria. This paper investigates the problem of neural network structure optimization. It proposes a way to construct a loss function, which contains a set of additive elements. Each element is called the regularizer. It corresponds to some part of the neural network structure and represents a criterion to optimize. The optimization procedure changes the structure in iterations. To optimize various parts of the structure, the procedure changes the set of regularizers according to some schedule. The authors propose a way to construct the additive regularization schedule. By comparing regularized models with non-regularized ones for a collection of datasets the computational experiments show that the proposed method finds efficient neural network structure and delivers accurate networks of low complexity.
- [39] arXiv:2406.12997 [pdf, html, other]
-
Title: Suitability of CCA for Generating Latent State/ Variables in Multi-View Textual DataSubjects: Computation and Language (cs.CL)
The probabilistic interpretation of Canonical Correlation Analysis (CCA) for learning low-dimensional real vectors, called as latent variables, has been exploited immensely in various fields. This study takes a step further by demonstrating the potential of CCA in discovering a latent state that captures the contextual information within the textual data under a two-view setting. The interpretation of CCA discussed in this study utilizes the multi-view nature of textual data, i.e. the consecutive sentences in a document or turns in a dyadic conversation, and has a strong theoretical foundation. Furthermore, this study proposes a model using CCA to perform the Automatic Short Answer Grading (ASAG) task. The empirical analysis confirms that the proposed model delivers competitive results and can even beat various sophisticated supervised techniques. The model is simple, linear, and adaptable and should be used as the baseline especially when labeled training data is scarce or nonexistent.
- [40] arXiv:2406.13000 [pdf, html, other]
-
Title: Randomized Greedy Online Edge Coloring Succeeds for Dense and Randomly-Ordered GraphsSubjects: Data Structures and Algorithms (cs.DS); Discrete Mathematics (cs.DM); Combinatorics (math.CO)
Vizing's theorem states that any graph of maximum degree $\Delta$ can be properly edge colored with at most $\Delta+1$ colors. In the online setting, it has been a matter of interest to find an algorithm that can properly edge color any graph on $n$ vertices with maximum degree $\Delta = \omega(\log n)$ using at most $(1+o(1))\Delta$ colors. Here we study the naïve random greedy algorithm, which simply chooses a legal color uniformly at random for each edge upon arrival. We show that this algorithm can $(1+\epsilon)\Delta$-color the graph for arbitrary $\epsilon$ in two contexts: first, if the edges arrive in a uniformly random order, and second, if the edges arrive in an adversarial order but the graph is sufficiently dense, i.e., $n = O(\Delta)$. Prior to this work, the random greedy algorithm was only known to succeed in trees.
Our second result is applicable even when the adversary is adaptive, and therefore implies the existence of a deterministic edge coloring algorithm which $(1+\epsilon)\Delta$ edge colors a dense graph. Prior to this, the best known deterministic algorithm for this problem was the simple greedy algorithm which utilized $2\Delta-1$ colors. - [41] arXiv:2406.13002 [pdf, html, other]
-
Title: Recurrence over Video Frames (RoVF) for the Re-identification of MeerkatsMitchell Rogers, Kobe Knowles, Gaël Gendron, Shahrokh Heidari, David Arturo Soriano Valdez, Mihailo Azhar, Padriac O'Leary, Simon Eyre, Michael Witbrock, Patrice DelmasComments: Presented as a poster at the CV4Animals Workshop, CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Deep learning approaches for animal re-identification have had a major impact on conservation, significantly reducing the time required for many downstream tasks, such as well-being monitoring. We propose a method called Recurrence over Video Frames (RoVF), which uses a recurrent head based on the Perceiver architecture to iteratively construct an embedding from a video clip. RoVF is trained using triplet loss based on the co-occurrence of individuals in the video frames, where the individual IDs are unavailable. We tested this method and various models based on the DINOv2 transformer architecture on a dataset of meerkats collected at the Wellington Zoo. Our method achieves a top-1 re-identification accuracy of $49\%$, which is higher than that of the best DINOv2 model ($42\%$). We found that the model can match observations of individuals where humans cannot, and our model (RoVF) performs better than the comparisons with minimal fine-tuning. In future work, we plan to improve these models by using pre-text tasks, apply them to animal behaviour classification, and perform a hyperparameter search to optimise the models further.
- [42] arXiv:2406.13006 [pdf, html, other]
-
Title: Weighted Sum of Segmented Correlation: An Efficient Method for Spectra Matching in Hyperspectral ImagesComments: Accepted in IEEE IGARSS 2024 conferenceSubjects: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Image and Video Processing (eess.IV)
Matching a target spectrum with known spectra in a spectral library is a common method for material identification in hyperspectral imaging research. Hyperspectral spectra exhibit precise absorption features across different wavelength segments, and the unique shapes and positions of these absorptions create distinct spectral signatures for each material, aiding in their identification. Therefore, only the specific positions can be considered for material identification. This study introduces the Weighted Sum of Segmented Correlation method, which calculates correlation indices between various segments of a library and a test spectrum, and derives a matching index, favoring positive correlations and penalizing negative correlations using assigned weights. The effectiveness of this approach is evaluated for mineral identification in hyperspectral images from both Earth and Martian surfaces.
- [43] arXiv:2406.13007 [pdf, html, other]
-
Title: NTIRE 2024 Challenge on Night Photography RenderingEgor Ershov, Artyom Panshin, Oleg Karasev, Sergey Korchagin, Shepelev Lev, Alexandr Startsev, Daniil Vladimirov, Ekaterina Zaychenkova, Nikola Banić, Dmitrii Iarchuk, Maria Efimova, Radu Timofte, Arseniy Terekhin, Shuwei Yue, Yuyang Liu, Minchen Wei, Lu Xu, Chao Zhang, Yasi Wang, Furkan Kınlı, Doğa Yılmaz, Barış Özcan, Furkan Kıraç, Shuai Liu, Jingyuan Xiao, Chaoyu Feng, Hao Wang, Guangqi Shao, Yuqian Zhang, Yibin Huang, Wei Luo, Liming Wang, Xiaotao Wang, Lei Lei, Simone Zini, Claudio Rota, Marco Buzzelli, Simone Bianco, Raimondo Schettini, Jin Guo, Tianli Liu, Mohao Wu, Ben Shao, Qirui Yang, Xianghui Li, Qihua Cheng, Fangpu Zhang, Zhiqiang Xu, Jingyu Yang, Huanjing YueComments: 10 pages, 10 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
This paper presents a review of the NTIRE 2024 challenge on night photography rendering. The goal of the challenge was to find solutions that process raw camera images taken in nighttime conditions, and thereby produce a photo-quality output images in the standard RGB (sRGB) space. Unlike the previous year's competition, the challenge images were collected with a mobile phone and the speed of algorithms was also measured alongside the quality of their output. To evaluate the results, a sufficient number of viewers were asked to assess the visual quality of the proposed solutions, considering the subjective nature of the task. There were 2 nominations: quality and efficiency. Top 5 solutions in terms of output quality were sorted by evaluation time (see Fig. 1). The top ranking participants' solutions effectively represent the state-of-the-art in nighttime photography rendering. More results can be found at this https URL.
- [44] arXiv:2406.13008 [pdf, html, other]
-
Title: ClaudesLens: Uncertainty Quantification in Computer Vision ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
In a world where more decisions are made using artificial intelligence, it is of utmost importance to ensure these decisions are well-grounded. Neural networks are the modern building blocks for artificial intelligence. Modern neural network-based computer vision models are often used for object classification tasks. Correctly classifying objects with \textit{certainty} has become of great importance in recent times. However, quantifying the inherent \textit{uncertainty} of the output from neural networks is a challenging task. Here we show a possible method to quantify and evaluate the uncertainty of the output of different computer vision models based on Shannon entropy. By adding perturbation of different levels, on different parts, ranging from the input to the parameters of the network, one introduces entropy to the system. By quantifying and evaluating the perturbed models on the proposed PI and PSI metrics, we can conclude that our theoretical framework can grant insight into the uncertainty of predictions of computer vision models. We believe that this theoretical framework can be applied to different applications for neural networks. We believe that Shannon entropy may eventually have a bigger role in the SOTA (State-of-the-art) methods to quantify uncertainty in artificial intelligence. One day we might be able to apply Shannon entropy to our neural systems.
- [45] arXiv:2406.13009 [pdf, other]
-
Title: Detecting Errors through Ensembling Prompts (DEEP): An End-to-End LLM Framework for Detecting Factual ErrorsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Accurate text summarization is one of the most common and important tasks performed by Large Language Models, where the costs of human review for an entire document may be high, but the costs of errors in summarization may be even greater. We propose Detecting Errors through Ensembling Prompts (DEEP) - an end-to-end large language model framework for detecting factual errors in text summarization. Our framework uses a diverse set of LLM prompts to identify factual inconsistencies, treating their outputs as binary features, which are then fed into ensembling models. We then calibrate the ensembled models to produce empirically accurate probabilities that a text is factually consistent or free of hallucination. We demonstrate that prior models for detecting factual errors in summaries perform significantly worse without optimizing the thresholds on subsets of the evaluated dataset. Our framework achieves state-of-the-art (SOTA) balanced accuracy on the AggreFact-XSUM FTSOTA, TofuEval Summary-Level, and HaluEval Summarization benchmarks in detecting factual errors within transformer-generated text summaries. It does so without any fine-tuning of the language model or reliance on thresholding techniques not available in practical settings.
- [46] arXiv:2406.13012 [pdf, html, other]
-
Title: Data Plagiarism Index: Characterizing the Privacy Risk of Data-Copying in Tabular Generative ModelsSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
The promise of tabular generative models is to produce realistic synthetic data that can be shared and safely used without dangerous leakage of information from the training set. In evaluating these models, a variety of methods have been proposed to measure the tendency to copy data from the training dataset when generating a sample. However, these methods suffer from either not considering data-copying from a privacy threat perspective, not being motivated by recent results in the data-copying literature or being difficult to make compatible with the high dimensional, mixed type nature of tabular data. This paper proposes a new similarity metric and Membership Inference Attack called Data Plagiarism Index (DPI) for tabular data. We show that DPI evaluates a new intuitive definition of data-copying and characterizes the corresponding privacy risk. We show that the data-copying identified by DPI poses both privacy and fairness threats to common, high performing architectures; underscoring the necessity for more sophisticated generative modeling techniques to mitigate this issue.
- [47] arXiv:2406.13015 [pdf, other]
-
Title: Deriving Hematological Disease Classes Using Fuzzy Logic and Expert Knowledge: A Comprehensive Machine Learning Approach with CBC ParametersSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
In the intricate field of medical diagnostics, capturing the subtle manifestations of diseases remains a challenge. Traditional methods, often binary in nature, may not encapsulate the nuanced variances that exist in real-world clinical scenarios. This paper introduces a novel approach by leveraging Fuzzy Logic Rules to derive disease classes based on expert domain knowledge from a medical practitioner. By recognizing that diseases do not always fit into neat categories, and that expert knowledge can guide the fuzzification of these boundaries, our methodology offers a more sophisticated and nuanced diagnostic tool.
Using a dataset procured from a prominent hospital, containing detailed patient blood count records, we harness Fuzzy Logic Rules, a computational technique celebrated for its ability to handle ambiguity. This approach, moving through stages of fuzzification, rule application, inference, and ultimately defuzzification, produces refined diagnostic predictions. When combined with the Random Forest classifier, the system adeptly predicts hematological conditions using Complete Blood Count (CBC) parameters.
Preliminary results showcase high accuracy levels, underscoring the advantages of integrating fuzzy logic into the diagnostic process. When juxtaposed with traditional diagnostic techniques, it becomes evident that Fuzzy Logic, especially when guided by medical expertise, offers significant advancements in the realm of hematological diagnostics. This paper not only paves the path for enhanced patient care but also beckons a deeper dive into the potentialities of fuzzy logic in various medical diagnostic applications. - [48] arXiv:2406.13017 [pdf, html, other]
-
Title: As Advertised? Understanding the Impact of Influencer VPN AdsComments: Accepted for publication at USENIX Security 2025Subjects: Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
Influencer VPN ads (sponsored segments) on YouTube often disseminate misleading information about both VPNs, and security & privacy more broadly. However, it remains unclear how (or whether) these ads affect users' perceptions and knowledge about VPNs. In this work, we explore the relationship between YouTube VPN ad exposure and users' mental models of VPNs, security, and privacy. We use a novel VPN ad detection model to calculate the ad exposure of 217 participants via their YouTube watch histories, and we develop scales to characterize their mental models in relation to claims commonly made in VPN ads. Through (pre-registered) regression-based analysis, we find that exposure to VPN ads is significantly correlated with familiarity with VPN brands and increased belief in (hyperbolic) threats. While not specific to VPNs, these threats are often discussed in VPN ads. In contrast, although many participants agree with both factual and misleading mental models of VPNs that often appear in ads, we find no significant correlation between exposure to VPN ads and these mental models. These findings suggest that, if VPN ads do impact mental models, then it is predominantly emotional (i.e., threat perceptions) rather than technical.
- [49] arXiv:2406.13018 [pdf, html, other]
-
Title: Reclaiming Power over AI: Equipping Queer Teens as AI Designers for HIV PreventionComments: In CHI 2024: Designing (with) AI for WellbeingSubjects: Human-Computer Interaction (cs.HC)
In this position paper, we explore the potential of generative AI (GenAI) tools in supporting HIV prevention initiatives among LGBTQ+ adolescents. GenAI offers opportunities to bridge information gaps and enhance healthcare access, yet it also risks exacerbating existing inequities through biased AI outputs reflecting heteronormative and cisnormative values. We advocate for the importance of queer adolescent-centered interventions, contend with the promise of GenAI tools while addressing concerns of bias, and position participatory frameworks for empowering queer youth in the design and development of AI tools. Viewing LGBTQ+ adolescents as designers, we propose a community-engaged approach to enable a group of queer teens with sexual health education expertise to design their own GenAI health tools. Through this collaborative effort, we put forward participatory ways to develop processes minimizing the potential iatrogenic harms of biased AI models, while harnessing AI benefits for LGBTQ+ teens. In this workshop, we offer specialized community-engaged knowledge in designing equitable AI tools to improve LGBTQ+ well-being.
- [50] arXiv:2406.13025 [pdf, html, other]
-
Title: ABNet: Attention BarrierNet for Safe and Scalable Robot LearningComments: 18 pagesSubjects: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
Safe learning is central to AI-enabled robots where a single failure may lead to catastrophic results. Barrier-based method is one of the dominant approaches for safe robot learning.
However, this method is not scalable, hard to train, and tends to generate unstable signals under noisy inputs that are challenging to be deployed for robots. To address these challenges, we propose a novel Attention BarrierNet (ABNet) that is scalable to build larger foundational safe models in an incremental manner.
Each head of BarrierNet in the ABNet could learn safe robot control policies from different features and focus on specific part of the observation. In this way, we do not need to one-shotly construct a large model for complex tasks, which significantly facilitates the training of the model while ensuring its stable output. Most importantly, we can still formally prove the safety guarantees of the ABNet. We demonstrate the strength of ABNet in 2D robot obstacle avoidance, safe robot manipulation, and vision-based end-to-end autonomous driving, with results showing much better robustness and guarantees over existing models. - [51] arXiv:2406.13027 [pdf, other]
-
Title: Investigating the Effect of Display Refresh Rate on First-Person Shooting GamesSubjects: Human-Computer Interaction (cs.HC)
For first-person shooting game players, display refresh rate is important for a smooth experience. Multiple studies have shown that a low display refresh rate will reduce gamers' experience and performance. However, the human eye's perception of refresh rate has an upper limit, which is usually less than what high-performance monitors, for which players pay much higher prices, provide. This study assesses whether a higher refresh rate always has a positive impact on players' performance, making it worthwhile for them to invest in high-performance monitors. A within-group experimental design study was conducted using a commercial first-person shooting game platform (N = 26) to investigate players' performance at display refresh rates of 30Hz, 60Hz, 120Hz, 144Hz, and 240Hz. Player performance was assessed based on score, accuracy, and self-ratings from the players. The results show that display refresh rate only significantly affects player performance at 30Hz.
- [52] arXiv:2406.13031 [pdf, html, other]
-
Title: A machine learning pipeline for automated insect monitoringAditya Jain, Fagner Cunha, Michael Bunsen, Léonard Pasi, Anna Viklund, Maxim Larrivée, David RolnickJournal-ref: NeurIPS 2023 Workshop on Tackling Climate Change with Machine LearningSubjects: Computer Vision and Pattern Recognition (cs.CV)
Climate change and other anthropogenic factors have led to a catastrophic decline in insects, endangering both biodiversity and the ecosystem services on which human society depends. Data on insect abundance, however, remains woefully inadequate. Camera traps, conventionally used for monitoring terrestrial vertebrates, are now being modified for insects, especially moths. We describe a complete, open-source machine learning-based software pipeline for automated monitoring of moths via camera traps, including object detection, moth/non-moth classification, fine-grained identification of moth species, and tracking individuals. We believe that our tools, which are already in use across three continents, represent the future of massively scalable data collection in entomology.
- [53] arXiv:2406.13034 [pdf, other]
-
Title: Real-time Yemeni Currency DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Banknote recognition is a major problem faced by visually Challenged people. So we propose a application to help the visually Challenged people to identify the different types of Yemenian currencies through deep learning technique. As money has a significant role in daily life for any business transactions, real-time detection and recognition of banknotes become necessary for a person, especially blind or visually impaired, or for a system that sorts the data. This paper presents a real-time Yemeni currency detection system for visually impaired persons. The proposed system exploits the deep learning approach to facilitate the visually impaired people to prosperously recognize banknotes. For real-time recognition, we have deployed the system into a mobile application.
- [54] arXiv:2406.13035 [pdf, html, other]
-
Title: D2O:Dynamic Discriminative Operations for Efficient Generative Inference of Large Language ModelsZhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Mi ZhangComments: Under reviewSubjects: Computation and Language (cs.CL)
Efficient inference in Large Language Models (LLMs) is impeded by the growing memory demands of key-value (KV) caching, especially for longer sequences. Traditional KV cache eviction strategies, which prioritize less critical KV-pairs based on attention scores, often degrade generation quality, leading to issues such as context loss or hallucinations. To address this, we introduce Dynamic Discriminative Operations (D2O), a novel method that utilizes two-level discriminative strategies to optimize KV cache size without fine-tuning, while preserving essential context. Initially, by observing varying densities of attention weights between shallow and deep layers, we use this insight to determine which layers should avoid excessive eviction to minimize information loss. Subsequently, for the eviction strategy in each layer, D2O innovatively incorporates a compensation mechanism that maintains a similarity threshold to re-discriminate the importance of previously discarded tokens, determining whether they should be recalled and merged with similar tokens. Our approach not only achieves significant memory savings and enhances inference throughput by more than 3x but also maintains high-quality long-text generation. Extensive experiments across various benchmarks and LLM architectures have demonstrated that D2O significantly enhances performance with a constrained KV cache budget.
- [55] arXiv:2406.13038 [pdf, html, other]
-
Title: Traffic Prediction considering Multiple Levels of Spatial-temporal Information: A Multi-scale Graph Wavelet-based ApproachSubjects: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
Although traffic prediction has been receiving considerable attention with a number of successes in the context of intelligent transportation systems, the prediction of traffic states over a complex transportation network that contains different road types has remained a challenge. This study proposes a multi-scale graph wavelet temporal convolution network (MSGWTCN) to predict the traffic states in complex transportation networks. Specifically, a multi-scale spatial block is designed to simultaneously capture the spatial information at different levels, and the gated temporal convolution network is employed to extract the temporal dependencies of the data. The model jointly learns to mount multiple levels of the spatial interactions by stacking graph wavelets with different scales. Two real-world datasets are used in this study to investigate the model performance, including a highway network in Seattle and a dense road network of Manhattan in New York City. Experiment results show that the proposed model outperforms other baseline models. Furthermore, different scales of graph wavelets are found to be effective in extracting local, intermediate and global information at the same time and thus enable the model to learn a complex transportation network topology with various types of road segments. By carefully customizing the scales of wavelets, the model is able to improve the prediction performance and better adapt to different network configurations.
- [56] arXiv:2406.13039 [pdf, html, other]
-
Title: Modeling and Controls of Fluid-Structure Interactions (FSI) in Dynamic Morphing FlightSubjects: Robotics (cs.RO)
The primary aim of this study is to enhance the accuracy of our aerodynamic Fluid-Structure Interaction (FSI) model to support the controlled tracking of 3D flight trajectories by Aerobat, which is a dynamic morphing winged drone. Building upon our previously documented Unsteady Aerodynamic model rooted in horseshoe vortices, we introduce a new iteration of Aerobat, labeled as version beta, which is designed for attachment to a Kinova arm. Through a series of experiments, we gather force-moment data from the robotic arm attachment and utilize it to fine-tune our unsteady model for banking turn maneuvers. Subsequently, we employ the tuned FSI model alongside a collocation control strategy to accomplish 3D banking turns of Aerobat within simulation environments. The primary contribution lies in presenting a methodical approach to calibrate our FSI model to predict complex 3D maneuvers and successfully assessing the model's potential for closed-loop flight control of Aerobat using an optimization-based collocation method.
- [57] arXiv:2406.13041 [pdf, html, other]
-
Title: Accelerated Stochastic Min-Max Optimization Based on Bias-corrected MomentumSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Lower-bound analyses for nonconvex strongly-concave minimax optimization problems have shown that stochastic first-order algorithms require at least $\mathcal{O}(\varepsilon^{-4})$ oracle complexity to find an $\varepsilon$-stationary point. Some works indicate that this complexity can be improved to $\mathcal{O}(\varepsilon^{-3})$ when the loss gradient is Lipschitz continuous. The question of achieving enhanced convergence rates under distinct conditions, remains unresolved. In this work, we address this question for optimization problems that are nonconvex in the minimization variable and strongly concave or Polyak-Lojasiewicz (PL) in the maximization variable. We introduce novel bias-corrected momentum algorithms utilizing efficient Hessian-vector products. We establish convergence conditions and demonstrate a lower iteration complexity of $\mathcal{O}(\varepsilon^{-3})$ for the proposed algorithms. The effectiveness of the method is validated through applications to robust logistic regression using real-world datasets.
- [58] arXiv:2406.13045 [pdf, html, other]
-
Title: Artifact Evaluation for Distributed Systems: Current Practices and BeyondComments: 13 pages, 2 figures, 3 tablesSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Although repeatability and reproducibility are essential in science, failed attempts to replicate results across diverse fields made some scientists argue for a reproducibility crisis. In response, several high-profile venues within computing established artifact evaluation tracks, a systematic procedure for evaluating and badging research artifacts, with an increasing number of artifacts submitted. This study compiles recent artifact evaluation procedures and guidelines to show how artifact evaluation in distributed systems research lags behind other computing disciplines and/or is less unified and more complex. We further argue that current artifact assessment criteria are uncoordinated and insufficient for the unique challenges of distributed systems research. We examine the current state of the practice for artifacts and their evaluation to provide recommendations to assist artifact authors, reviewers, and track chairs. We summarize the recommendations and best practices as checklists for artifact authors and evaluation committees. Although our recommendations alone will not resolve the repeatability and reproducibility crisis, we want to start a discussion in our community to increase the number of submitted artifacts and their quality over time.
- [59] arXiv:2406.13046 [pdf, html, other]
-
Title: Bayesian-LoRA: LoRA based Parameter Efficient Fine-Tuning using Optimal Quantization levels and Rank Values trough Differentiable Bayesian GatesSubjects: Artificial Intelligence (cs.AI)
It is a common practice in natural language processing to pre-train a single model on a general domain and then fine-tune it for downstream tasks. However, when it comes to Large Language Models, fine-tuning the entire model can be computationally expensive, resulting in very intensive energy consumption. As a result, several Parameter efficient fine-tuning (PEFT) approaches were recently proposed. One of the most popular approaches is low-rank adaptation (LoRA), where the key insight is decomposing the update weights of the pre-trained model into two low-rank matrices. However, the proposed approaches either use the same rank value across all different weight matrices or do not use any quantization technique, which has been shown to be one of the most important factors when it comes to a model's energy consumption. In this work, we propose Bayesian-LoRA (B-LoRA) which approaches matrix decomposition and quantization from a Bayesian perspective by employing a prior distribution on both quantization levels and rank values of the learned low-rank matrices. As a result, B-LoRA is able to fine-tune a pre-trained model on a specific downstream task, finding the optimal rank values and quantization levels for every low-rank matrix. We validate the proposed model fine-tuning a pre-trained DeBERTaV3 on the GLUE benchmark. Moreover, we compare it to relevant baselines and present both qualitative and quantitative results, showing how the proposed approach is able to learn optimal-rank quantized matrices. B-LoRA performs on par or better than baselines while reducing the total amount of bit operations of roughly 70% with respect to the baselines ones.
- [60] arXiv:2406.13048 [pdf, html, other]
-
Title: Head Pose Estimation and 3D Neural Surface Reconstruction via Monocular Camera in situ for Navigation and Safe Insertion into Natural OpeningsComments: Accepted by ICBIR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
As the significance of simulation in medical care and intervention continues to grow, it is anticipated that a simplified and low-cost platform can be set up to execute personalized diagnoses and treatments. 3D Slicer can not only perform medical image analysis and visualization but can also provide surgical navigation and surgical planning functions. In this paper, we have chosen 3D Slicer as our base platform and monocular cameras are used as sensors. Then, We used the neural radiance fields (NeRF) algorithm to complete the 3D model reconstruction of the human head. We compared the accuracy of the NeRF algorithm in generating 3D human head scenes and utilized the MarchingCube algorithm to generate corresponding 3D mesh models. The individual's head pose, obtained through single-camera vision, is transmitted in real-time to the scene created within 3D Slicer. The demonstrations presented in this paper include real-time synchronization of transformations between the human head model in the 3D Slicer scene and the detected head posture. Additionally, we tested a scene where a tool, marked with an ArUco Maker tracked by a single camera, synchronously points to the real-time transformation of the head posture. These demos indicate that our methodology can provide a feasible real-time simulation platform for nasopharyngeal swab collection or intubation.
- [61] arXiv:2406.13049 [pdf, html, other]
-
Title: Assessing AI vs Human-Authored Spear Phishing SMS Attacks: An Empirical Study Using the TRAPD MethodComments: 18 pages, 5 figures, 1 tableSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
This paper explores the rising concern of utilizing Large Language Models (LLMs) in spear phishing message generation, and their performance compared to human-authored counterparts. Our pilot study compares the effectiveness of smishing (SMS phishing) messages created by GPT-4 and human authors, which have been personalized to willing targets. The targets assessed the messages in a modified ranked-order experiment using a novel methodology we call TRAPD (Threshold Ranking Approach for Personalized Deception). Specifically, targets provide personal information (job title and location, hobby, item purchased online), spear smishing messages are created using this information by humans and GPT-4, targets are invited back to rank-order 12 messages from most to least convincing (and identify which they would click on), and then asked questions about why they ranked messages the way they did. They also guess which messages are created by an LLM and their reasoning. Results from 25 targets show that LLM-generated messages are most often perceived as more convincing than those authored by humans, with messages related to jobs being the most convincing. We characterize different criteria used when assessing the authenticity of messages including word choice, style, and personal relevance. Results also show that targets were unable to identify whether the messages was AI-generated or human-authored and struggled to identify criteria to use in order to make this distinction. This study aims to highlight the urgent need for further research and improved countermeasures against personalized AI-enabled social engineering attacks.
- [62] arXiv:2406.13050 [pdf, html, other]
-
Title: Think-then-Act: A Dual-Angle Evaluated Retrieval-Augmented GenerationComments: 12 pages, 8 figuresSubjects: Computation and Language (cs.CL)
Despite their impressive capabilities, large language models (LLMs) often face challenges such as temporal misalignment and generating hallucinatory content. Enhancing LLMs with retrieval mechanisms to fetch relevant information from external sources offers a promising solution. Inspired by the proverb "Think twice before you act," we propose a dual-angle evaluated retrieval-augmented generation framework \textit{Think-then-Act}. Unlike previous approaches that indiscriminately rewrite queries or perform retrieval regardless of necessity, or generate temporary responses before deciding on additional retrieval, which increases model generation costs, our framework employs a two-phase process: (i) assessing the input query for clarity and completeness to determine if rewriting is necessary; and (ii) evaluating the model's capability to answer the query and deciding if additional retrieval is needed. Experimental results on five datasets show that the \textit{Think-then-Act} framework significantly improves performance. Our framework demonstrates notable improvements in accuracy and efficiency compared to existing baselines and performs well in both English and non-English contexts. Ablation studies validate the optimal model confidence threshold, highlighting the resource optimization benefits of our approach.
- [63] arXiv:2406.13057 [pdf, html, other]
-
Title: Informed along the road: roadway capacity driven graph convolution network for network-wide traffic predictionSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
While deep learning has shown success in predicting traffic states, most methods treat it as a general prediction task without considering transportation aspects. Recently, graph neural networks have proven effective for this task, but few incorporate external factors that impact roadway capacity and traffic flow. This study introduces the Roadway Capacity Driven Graph Convolution Network (RCDGCN) model, which incorporates static and dynamic roadway capacity attributes in spatio-temporal settings to predict network-wide traffic states. The model was evaluated on two real-world datasets with different transportation factors: the ICM-495 highway network and an urban network in Manhattan, New York City. Results show RCDGCN outperformed baseline methods in forecasting accuracy. Analyses, including ablation experiments, weight analysis, and case studies, investigated the effect of capacity-related factors. The study demonstrates the potential of using RCDGCN for transportation system management.
- [64] arXiv:2406.13060 [pdf, html, other]
-
Title: Scale-Translation Equivariant Network for Oceanic Internal Solitary Wave LocalizationComments: 29 pages, 5 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
Internal solitary waves (ISWs) are gravity waves that are often observed in the interior ocean rather than the surface. They hold significant importance due to their capacity to carry substantial energy, thus influence pollutant transport, oil platform operations, submarine navigation, etc. Researchers have studied ISWs through optical images, synthetic aperture radar (SAR) images, and altimeter data from remote sensing instruments. However, cloud cover in optical remote sensing images variably obscures ground information, leading to blurred or missing surface observations. As such, this paper aims at altimeter-based machine learning solutions to automatically locate ISWs. The challenges, however, lie in the following two aspects: 1) the altimeter data has low resolution, which requires a strong machine learner; 2) labeling data is extremely labor-intensive, leading to very limited data for training. In recent years, the grand progress of deep learning demonstrates strong learning capacity given abundant data. Besides, more recent studies on efficient learning and self-supervised learning laid solid foundations to tackle the aforementioned challenges. In this paper, we propose to inject prior knowledge to achieve a strong and efficient learner. Specifically, intrinsic patterns in altimetry data are efficiently captured using a scale-translation equivariant convolutional neural network (ST-ECNN). By considering inherent symmetries in neural network design, ST-ECNN achieves higher efficiency and better performance than baseline models. Furthermore, we also introduce prior knowledge from massive unsupervised data to enhance our solution using the SimCLR framework for pre-training. Our final solution achieves an overall better performance than baselines on our handcrafted altimetry dataset. Data and codes are available at this https URL .
- [65] arXiv:2406.13062 [pdf, html, other]
-
Title: Transforming Property GraphsComments: To appear in VLDB 2024Subjects: Databases (cs.DB)
In this paper, we study a declarative framework for specifying transformations of property graphs. In order to express such transformations, we leverage queries formulated in the Graph Pattern Calculus (GPC), which is an abstraction of the common core of recent standard graph query languages, GQL and SQL/PGQ. In contrast to previous frameworks targeting graph topology only, we focus on the impact of data values on the transformations--which is crucial in addressing users needs. In particular, we study the complexity of checking if the transformation rules do not specify conflicting values for properties, and we show this is closely related to the satisfiability problem for GPC. We prove that both problems are PSpace-complete. We have implemented our framework in openCypher. We show the flexibility and usability of our framework by leveraging an existing data integration benchmark, adapting it to our needs. We also evaluate the incurred overhead of detecting potential inconsistencies at run-time, and the impact of several optimization tools in a Cypher-based graph database, by providing a comprehensive comparison of different implementation variants. The results of our experimental study show that our framework exhibits large practical benefits for transforming property graphs compared to ad-hoc transformation scripts.
- [66] arXiv:2406.13064 [pdf, other]
-
Title: Machine Learning and Optimization Techniques for Solving Inverse Kinematics in a 7-DOF Robotic ArmSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
As the pace of AI technology continues to accelerate, more tools have become available to researchers to solve longstanding problems, Hybrid approaches available today continue to push the computational limits of efficiency and precision. One of such problems is the inverse kinematics of redundant systems. This paper explores the complexities of a 7 degree of freedom manipulator and explores 13 optimization techniques to solve it. Additionally, a novel approach is proposed to contribute to the field of algorithmic research. This was found to be over 200 times faster than the well-known traditional Particle Swarm Optimization technique. This new method may serve as a new field of search that combines the explorative capabilities of Machine Learning with the exploitative capabilities of numerical methods.
- [67] arXiv:2406.13066 [pdf, html, other]
-
Title: MaskPure: Improving Defense Against Text Adversaries with Stochastic PurificationComments: 15 pages, 1 figure, in the proceedings of The 29th International Conference on Natural Language & Information Systems (NLDB 2024)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
The improvement of language model robustness, including successful defense against adversarial attacks, remains an open problem. In computer vision settings, the stochastic noising and de-noising process provided by diffusion models has proven useful for purifying input images, thus improving model robustness against adversarial attacks. Similarly, some initial work has explored the use of random noising and de-noising to mitigate adversarial attacks in an NLP setting, but improving the quality and efficiency of these methods is necessary for them to remain competitive. We extend upon methods of input text purification that are inspired by diffusion processes, which randomly mask and refill portions of the input text before classification. Our novel method, MaskPure, exceeds or matches robustness compared to other contemporary defenses, while also requiring no adversarial classifier training and without assuming knowledge of the attack type. In addition, we show that MaskPure is provably certifiably robust. To our knowledge, MaskPure is the first stochastic-purification method with demonstrated success against both character-level and word-level attacks, indicating the generalizable and promising nature of stochastic denoising defenses. In summary: the MaskPure algorithm bridges literature on the current strongest certifiable and empirical adversarial defense methods, showing that both theoretical and practical robustness can be obtained together. Code is available on GitHub at this https URL.
- [68] arXiv:2406.13069 [pdf, html, other]
-
Title: Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWGComments: 8 page preprint + appendixSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
How novel are texts generated by language models (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate $n$-grams from their training data, evaluating both (i) the probability LMs assign to complete training $n$-grams and (ii) $n$-novelty, the proportion of $n$-grams generated by an LM that did not appear in the training data (for arbitrarily large $n$). To enable arbitrary-length $n$-gram search over a corpus in constant time, we develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data. We compare the novelty of LM-generated text to human-written text and explore factors that affect generation novelty, focusing on the Pythia models. We find that, for $n > 4$, LM-generated text is less novel than human-written text, though it is more novel for smaller $n$. Larger LMs and more constrained decoding strategies both decrease novelty. Finally, we show that LMs complete $n$-grams with lower loss if they are less frequent in the training data. Overall, our results reveal factors influencing the novelty of LM-generated text, and we release Rusty-DAWG to facilitate further pretraining data research.
- [69] arXiv:2406.13073 [pdf, html, other]
-
Title: NoiSec: Harnessing Noise for Security against Adversarial and Backdoor AttacksComments: 20 pages, 7 figuresSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
The exponential adoption of machine learning (ML) is propelling the world into a future of intelligent automation and data-driven solutions. However, the proliferation of malicious data manipulation attacks against ML, namely adversarial and backdoor attacks, jeopardizes its reliability in safety-critical applications. The existing detection methods against such attacks are built upon assumptions, limiting them in diverse practical scenarios. Thus, motivated by the need for a more robust and unified defense mechanism, we investigate the shared traits of adversarial and backdoor attacks and propose NoiSec that leverages solely the noise, the foundational root cause of such attacks, to detect any malicious data alterations. NoiSec is a reconstruction-based detector that disentangles the noise from the test input, extracts the underlying features from the noise, and leverages them to recognize systematic malicious manipulation. Experimental evaluations conducted on the CIFAR10 dataset demonstrate the efficacy of NoiSec, achieving AUROC scores exceeding 0.954 and 0.852 under white-box and black-box adversarial attacks, respectively, and 0.992 against backdoor attacks. Notably, NoiSec maintains a high detection performance, keeping the false positive rate within only 1\%. Comparative analyses against MagNet-based baselines reveal NoiSec's superior performance across various attack scenarios.
- [70] arXiv:2406.13075 [pdf, other]
-
Title: Exact Community Recovery (under Side Information): Optimality of Spectral AlgorithmsComments: 70 pages, 3 FiguresSubjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG)
In this paper, we study the problem of exact community recovery in general, two-community block models considering both Bernoulli and Gaussian matrix models, capturing the Stochastic Block Model, submatrix localization, and $\mathbb{Z}_2$-synchronization as special cases. We also study the settings where $side$ $information$ about community assignment labels is available, modeled as passing the true labels through a noisy channel: either the binary erasure channel (where some community labels are known while others are erased) or the binary symmetric channel (where some labels are flipped). We provide a unified analysis of the effect of side information on the information-theoretic limits of exact recovery, generalizing prior works and extending to new settings. Additionally, we design a simple but optimal spectral algorithm that incorporates side information (when present) along with the eigenvectors of the matrix observation. Using the powerful tool of entrywise eigenvector analysis [Abbe, Fan, Wang, Zhong 2020], we show that our spectral algorithm can mimic the so called $genie$-$aided$ $estimators$, where the $i^{\mathrm{th}}$ genie-aided estimator optimally computes the estimate of the $i^{\mathrm{th}}$ label, when all remaining labels are revealed by a genie. This perspective provides a unified understanding of the optimality of spectral algorithms for various exact recovery problems in a recent line of work.
- [71] arXiv:2406.13080 [pdf, html, other]
-
Title: An Experimental Characterization of Combined RowHammer and RowPress Read Disturbance in Modern DRAM ChipsSubjects: Hardware Architecture (cs.AR); Cryptography and Security (cs.CR)
DRAM read disturbance can break memory isolation, a fundamental property to ensure system robustness (i.e., reliability, security, safety). RowHammer and RowPress are two different DRAM read disturbance phenomena. RowHammer induces bitflips in physically adjacent victim DRAM rows by repeatedly opening and closing an aggressor DRAM row, while RowPress induces bitflips by keeping an aggressor DRAM row open for a long period of time. In this study, we characterize a DRAM access pattern that combines RowHammer and RowPress in 84 real DDR4 DRAM chips from all three major DRAM manufacturers. Our key results show that 1) this combined RowHammer and RowPress pattern takes significantly smaller amount of time (up to 46.1% faster) to induce the first bitflip compared to the state-of-the-art RowPress pattern, and 2) at the minimum aggressor row activation count to induce at least one bitflip, the bits that flip are different across RowHammer, RowPress, and the combined patterns. Based on our results, we provide a key hypothesis that the read disturbance effect caused by RowPress from one of the two aggressor rows in a double-sided pattern is much more significant than the other.
- [72] arXiv:2406.13081 [pdf, html, other]
-
Title: Class-specific Data Augmentation for Plant Stress ClassificationNasla Saleem, Aditya Balu, Talukder Zaki Jubery, Arti Singh, Asheesh K. Singh, Soumik Sarkar, Baskar GanapathysubramanianSubjects: Computer Vision and Pattern Recognition (cs.CV)
Data augmentation is a powerful tool for improving deep learning-based image classifiers for plant stress identification and classification. However, selecting an effective set of augmentations from a large pool of candidates remains a key challenge, particularly in imbalanced and confounding datasets. We propose an approach for automated class-specific data augmentation using a genetic algorithm. We demonstrate the utility of our approach on soybean [Glycine max (L.) Merr] stress classification where symptoms are observed on leaves; a particularly challenging problem due to confounding classes in the dataset. Our approach yields substantial performance, achieving a mean-per-class accuracy of 97.61% and an overall accuracy of 98% on the soybean leaf stress dataset. Our method significantly improves the accuracy of the most challenging classes, with notable enhancements from 83.01% to 88.89% and from 85.71% to 94.05%, respectively.
A key observation we make in this study is that high-performing augmentation strategies can be identified in a computationally efficient manner. We fine-tune only the linear layer of the baseline model with different augmentations, thereby reducing the computational burden associated with training classifiers from scratch for each augmentation policy while achieving exceptional performance. This research represents an advancement in automated data augmentation strategies for plant stress classification, particularly in the context of confounding datasets. Our findings contribute to the growing body of research in tailored augmentation techniques and their potential impact on disease management strategies, crop yields, and global food security. The proposed approach holds the potential to enhance the accuracy and efficiency of deep learning-based tools for managing plant stresses in agriculture. - [73] arXiv:2406.13086 [pdf, html, other]
-
Title: NaviSplit: Dynamic Multi-Branch Split DNNs for Efficient Distributed Autonomous NavigationJournal-ref: WoWMoM 2024Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Lightweight autonomous unmanned aerial vehicles (UAV) are emerging as a central component of a broad range of applications. However, autonomous navigation necessitates the implementation of perception algorithms, often deep neural networks (DNN), that process the input of sensor observations, such as that from cameras and LiDARs, for control logic. The complexity of such algorithms clashes with the severe constraints of these devices in terms of computing power, energy, memory, and execution time. In this paper, we propose NaviSplit, the first instance of a lightweight navigation framework embedding a distributed and dynamic multi-branched neural model. At its core is a DNN split at a compression point, resulting in two model parts: (1) the head model, that is executed at the vehicle, which partially processes and compacts perception from sensors; and (2) the tail model, that is executed at an interconnected compute-capable device, which processes the remainder of the compacted perception and infers navigation commands. Different from prior work, the NaviSplit framework includes a neural gate that dynamically selects a specific head model to minimize channel usage while efficiently supporting the navigation network. In our implementation, the perception model extracts a 2D depth map from a monocular RGB image captured by the drone using the robust simulator Microsoft AirSim. Our results demonstrate that the NaviSplit depth model achieves an extraction accuracy of 72-81% while transmitting an extremely small amount of data (1.2-18 KB) to the edge server. When using the neural gate, as utilized by NaviSplit, we obtain a slightly higher navigation accuracy as compared to a larger static network by 0.3% while significantly reducing the data rate by 95%. To the best of our knowledge, this is the first exemplar of dynamic multi-branched model based on split DNNs for autonomous navigation.
- [74] arXiv:2406.13092 [pdf, html, other]
-
Title: Multilingual Synopses of Movie Narratives: A Dataset for Story UnderstandingComments: 16 pages, 9 figuresSubjects: Computation and Language (cs.CL)
Story video-text alignment, a core task in computational story understanding, aims to align video clips with corresponding sentences in their descriptions. However, progress on the task has been held back by the scarcity of manually annotated video-text correspondence and the heavy concentration on English narrations of Hollywood movies. To address these issues, in this paper, we construct a large-scale multilingual video story dataset named Multilingual Synopses of Movie Narratives (M-SYMON), containing 13,166 movie summary videos from 7 languages, as well as manual annotation of fine-grained video-text correspondences for 101.5 hours of video. Training on the human annotated data from SyMoN outperforms the SOTA methods by 15.7 and 16.2 percentage points on Clip Accuracy and Sentence IoU scores, respectively, demonstrating the effectiveness of the annotations. As benchmarks for future research, we create 6 baseline approaches with different multilingual training strategies, compare their performance in both intra-lingual and cross-lingual setups, exemplifying the challenges of multilingual video-text alignment.
- [75] arXiv:2406.13093 [pdf, html, other]
-
Title: RITA: A Real-time Interactive Talking Avatars FrameworkSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
RITA presents a high-quality real-time interactive framework built upon generative models, designed with practical applications in mind. Our framework enables the transformation of user-uploaded photos into digital avatars that can engage in real-time dialogue interactions. By leveraging the latest advancements in generative modeling, we have developed a versatile platform that not only enhances the user experience through dynamic conversational avatars but also opens new avenues for applications in virtual reality, online education, and interactive gaming. This work showcases the potential of integrating computer vision and natural language processing technologies to create immersive and interactive digital personas, pushing the boundaries of how we interact with digital content.
- [76] arXiv:2406.13094 [pdf, html, other]
-
Title: Exploring and Benchmarking the Planning Capabilities of Large Language ModelsBernd Bohnet, Azade Nova, Aaron T Parisi, Kevin Swersky, Katayoon Goshvadi, Hanjun Dai, Dale Schuurmans, Noah Fiedel, Hanie SedghiSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
We seek to elevate the planning capabilities of Large Language Models (LLMs)investigating four main directions. First, we construct a comprehensive benchmark suite encompassing both classical planning domains and natural language scenarios. This suite includes algorithms to generate instances with varying levels of difficulty, allowing for rigorous and systematic evaluation of LLM performance. Second, we investigate the use of in-context learning (ICL) to enhance LLM planning, exploring the direct relationship between increased context length and improved planning performance. Third, we demonstrate the positive impact of fine-tuning LLMs on optimal planning paths, as well as the effectiveness of incorporating model-driven search procedures. Finally, we investigate the performance of the proposed methods in out-of-distribution scenarios, assessing the ability to generalize to novel and unseen planning challenges.
- [77] arXiv:2406.13098 [pdf, other]
-
Title: DLP: towards active defense against backdoor attacks with decoupled learning processSubjects: Cryptography and Security (cs.CR)
Deep learning models are well known to be susceptible to backdoor attack, where the attacker only needs to provide a tampered dataset on which the triggers are injected. Models trained on the dataset will passively implant the backdoor, and triggers on the input can mislead the models during testing. Our study shows that the model shows different learning behaviors in clean and poisoned subsets during training. Based on this observation, we propose a general training pipeline to defend against backdoor attacks actively. Benign models can be trained from the unreliable dataset by decoupling the learning process into three stages, i.e., supervised learning, active unlearning, and active semi-supervised fine-tuning. The effectiveness of our approach has been shown in numerous experiments across various backdoor attacks and datasets.
- [78] arXiv:2406.13099 [pdf, html, other]
-
Title: Sampling 3D Gaussian Scenes in Seconds with Latent Diffusion ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
We present a latent diffusion model over 3D scenes, that can be trained using only 2D image data. To achieve this, we first design an autoencoder that maps multi-view images to 3D Gaussian splats, and simultaneously builds a compressed latent representation of these splats. Then, we train a multi-view diffusion model over the latent space to learn an efficient generative model. This pipeline does not require object masks nor depths, and is suitable for complex scenes with arbitrary camera positions. We conduct careful experiments on two large-scale datasets of complex real-world scenes -- MVImgNet and RealEstate10K. We show that our approach enables generating 3D scenes in as little as 0.2 seconds, either from scratch, from a single input view, or from sparse input views. It produces diverse and high-quality results while running an order of magnitude faster than non-latent diffusion models and earlier NeRF-based generative models
- [79] arXiv:2406.13101 [pdf, html, other]
-
Title: On instabilities in neural network-based physics simulatorsComments: 15 pagesSubjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Chaotic Dynamics (nlin.CD)
When neural networks are trained from data to simulate the dynamics of physical systems, they encounter a persistent challenge: the long-time dynamics they produce are often unphysical or unstable. We analyze the origin of such instabilities when learning linear dynamical systems, focusing on the training dynamics. We make several analytical findings which empirical observations suggest extend to nonlinear dynamical systems. First, the rate of convergence of the training dynamics is uneven and depends on the distribution of energy in the data. As a special case, the dynamics in directions where the data have no energy cannot be learned. Second, in the unlearnable directions, the dynamics produced by the neural network depend on the weight initialization, and common weight initialization schemes can produce unstable dynamics. Third, injecting synthetic noise into the data during training adds damping to the training dynamics and can stabilize the learned simulator, though doing so undesirably biases the learned dynamics. For each contributor to instability, we suggest mitigative strategies. We also highlight important differences between learning discrete-time and continuous-time dynamics, and discuss extensions to nonlinear systems.
- [80] arXiv:2406.13103 [pdf, html, other]
-
Title: A Generic Method for Fine-grained Category Discovery in Natural Language TextsComments: preprintSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Fine-grained category discovery using only coarse-grained supervision is a cost-effective yet challenging task. Previous training methods focus on aligning query samples with positive samples and distancing them from negatives. They often neglect intra-category and inter-category semantic similarities of fine-grained categories when navigating sample distributions in the embedding space. Furthermore, some evaluation techniques that rely on pre-collected test samples are inadequate for real-time applications. To address these shortcomings, we introduce a method that successfully detects fine-grained clusters of semantically similar texts guided by a novel objective function. The method uses semantic similarities in a logarithmic space to guide sample distributions in the Euclidean space and to form distinct clusters that represent fine-grained categories. We also propose a centroid inference mechanism to support real-time applications. The efficacy of the method is both theoretically justified and empirically confirmed on three benchmark tasks. The proposed objective function is integrated in multiple contrastive learning based neural models. Its results surpass existing state-of-the-art approaches in terms of Accuracy, Adjusted Rand Index and Normalized Mutual Information of the detected fine-grained categories. Code and data will be available at this https URL upon publication.
- [81] arXiv:2406.13105 [pdf, html, other]
-
Title: A transformer boosted UNet for smoke segmentation in complex backgrounds in multispectral LandSat imagerySubjects: Computer Vision and Pattern Recognition (cs.CV)
Many studies have been done to detect smokes from satellite imagery. However, these prior methods are not still effective in detecting various smokes in complex backgrounds. Smokes present challenges in detection due to variations in density, color, lighting, and backgrounds such as clouds, haze, and/or mist, as well as the contextual nature of thin smoke. This paper addresses these challenges by proposing a new segmentation model called VTrUNet which consists of a virtual band construction module to capture spectral patterns and a transformer boosted UNet to capture long range contextual features. The model takes imagery of six bands: red, green, blue, near infrared, and two shortwave infrared bands as input. To show the advantages of the proposed model, the paper presents extensive results for various possible model architectures improving UNet and draws interesting conclusions including that adding more modules to a model does not always lead to a better performance. The paper also compares the proposed model with very recently proposed and related models for smoke segmentation and shows that the proposed model performs the best and makes significant improvements on prediction performances
- [82] arXiv:2406.13106 [pdf, html, other]
-
Title: Accelerating Complex Disease Treatment through Network Medicine and GenAI: A Case Study on Drug Repurposing for Breast CancerComments: 9 pages double columns, 5 figures, 3 algorithms, 3 tables, and 1 listing, Submitted to IEEE MedAI'24 Conference, to be held November 15-17, Chongqing, ChinaSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
The objective of this research is to introduce a network specialized in predicting drugs that can be repurposed by investigating real-world evidence sources, such as clinical trials and biomedical literature. Specifically, it aims to generate drug combination therapies for complex diseases (e.g., cancer, Alzheimer's). We present a multilayered network medicine approach, empowered by a highly configured ChatGPT prompt engineering system, which is constructed on the fly to extract drug mentions in clinical trials. Additionally, we introduce a novel algorithm that connects real-world evidence with disease-specific signaling pathways (e.g., KEGG database). This sheds light on the repurposability of drugs if they are found to bind with one or more protein constituents of a signaling pathway. To demonstrate, we instantiated the framework for breast cancer and found that, out of 46 breast cancer signaling pathways, the framework identified 38 pathways that were covered by at least two drugs. This evidence signals the potential for combining those drugs. Specifically, the most covered signaling pathway, ID hsa:2064, was covered by 108 drugs, some of which can be combined. Conversely, the signaling pathway ID hsa:1499 was covered by only two drugs, indicating a significant gap for further research. Our network medicine framework, empowered by GenAI, shows promise in identifying drug combinations with a high degree of specificity, knowing the exact signaling pathways and proteins that serve as targets. It is noteworthy that ChatGPT successfully accelerated the process of identifying drug mentions in clinical trials, though further investigations are required to determine the relationships among the drug mentions.
- [83] arXiv:2406.13107 [pdf, html, other]
-
Title: Blitzcrank: Fast Semantic Compression for In-memory Online Transaction ProcessingComments: 18 pages, 19 figures, to be published in VLDB'24Subjects: Databases (cs.DB)
We present BLITZCRANK, a high-speed semantic compressor designed for OLTP databases. Previous solutions are inadequate for compressing row-stores: they suffer from either low compression factor due to a coarse compression granularity or suboptimal performance due to the inefficiency in handling dynamic data sets. To solve these problems, we first propose novel semantic models that support fast inferences and dynamic value set for both discrete and continuous data types. We then introduce a new entropy encoding algorithm, called delayed coding, that achieves significant improvement in the decoding speed compared to modern arithmetic coding implementations. We evaluate BLITZCRANK in both standalone microbenchmarks and a multicore in-memory row-store using the TCPC-C benchmark. Our results show that BLITZCRANK achieves a sub-microsecond latency for decompressing a random tuple while obtaining high compression factors. This leads to an 85% memory reduction in the TPC-C evaluation with a moderate (19%) throughput degradation. For data sets larger than the available physical memory, BLITZCRANK help the database sustain a high throughput for more transactions before the l/O overhead dominates.
- [84] arXiv:2406.13113 [pdf, html, other]
-
Title: CU-Net: a U-Net architecture for efficient brain-tumor segmentation on BraTS 2019 datasetSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
Accurately segmenting brain tumors from MRI scans is important for developing effective treatment plans and improving patient outcomes. This study introduces a new implementation of the Columbia-University-Net (CU-Net) architecture for brain tumor segmentation using the BraTS 2019 dataset. The CU-Net model has a symmetrical U-shaped structure and uses convolutional layers, max pooling, and upsampling operations to achieve high-resolution segmentation. Our CU-Net model achieved a Dice score of 82.41%, surpassing two other state-of-the-art models. This improvement in segmentation accuracy highlights the robustness and effectiveness of the model, which helps to accurately delineate tumor boundaries, which is crucial for surgical planning and radiation therapy, and ultimately has the potential to improve patient outcomes.
- [85] arXiv:2406.13114 [pdf, html, other]
-
Title: Multi-Stage Balanced Distillation: Addressing Long-Tail Challenges in Sequence-Level Knowledge DistillationComments: preprintSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large language models (LLMs) have significantly advanced various natural language processing tasks, but deploying them remains computationally expensive. Knowledge distillation (KD) is a promising solution, enabling the transfer of capabilities from larger teacher LLMs to more compact student models. Particularly, sequence-level KD, which distills rationale-based reasoning processes instead of merely final outcomes, shows great potential in enhancing students' reasoning capabilities. However, current methods struggle with sequence level KD under long-tailed data distributions, adversely affecting generalization on sparsely represented domains. We introduce the Multi-Stage Balanced Distillation (BalDistill) framework, which iteratively balances training data within a fixed computational budget. By dynamically selecting representative head domain examples and synthesizing tail domain examples, BalDistill achieves state-of-the-art performance across diverse long-tailed datasets, enhancing both the efficiency and efficacy of the distilled models.
- [86] arXiv:2406.13115 [pdf, html, other]
-
Title: Dynamic Walking on Highly Underactuated Point Foot Humanoids: Closing the Loop between HZD and HLIPComments: 9 pages, 10 figuresSubjects: Robotics (cs.RO)
Realizing bipedal locomotion on humanoid robots with point feet is especially challenging due to their highly underactuated nature, high degrees of freedom, and hybrid dynamics resulting from impacts. With the goal of addressing this challenging problem, this paper develops a control framework for realizing dynamic locomotion and implements it on a novel point foot humanoid: ADAM. To this end, we close the loop between Hybrid Zero Dynamics (HZD) and Hybrid linear inverted pendulum (HLIP) based step length regulation. To leverage the full-order hybrid dynamics of the robot, walking gaits are first generated offline by utilizing HZD. These trajectories are stabilized online through the use of a HLIP based regulator. Finally, the planned trajectories are mapped into the full-order system using a task space controller incorporating inverse kinematics. The proposed method is verified through numerical simulations and hardware experiments on the humanoid robot ADAM marking the first humanoid point foot walking. Moreover, we experimentally demonstrate the robustness of the realized walking via the ability to track a desired reference speed, robustness to pushes, and locomotion on uneven terrain.
- [87] arXiv:2406.13116 [pdf, html, other]
-
Title: A Lower Bound on Swap Regret in Extensive-Form GamesSubjects: Computer Science and Game Theory (cs.GT)
Recent simultaneous works by Peng and Rubinstein [2024] and Dagan et al. [2024] have demonstrated the existence of a no-swap-regret learning algorithm that can reach $\epsilon$ average swap regret against an adversary in any extensive-form game within $m^{\tilde{\mathcal O}(1/\epsilon)}$ rounds, where $m$ is the number of nodes in the game tree. However, the question of whether a $\mathrm{poly}(m, 1/\epsilon)$-round algorithm could exist remained open. In this paper, we show a lower bound that precludes the existence of such an algorithm. In particular, we show that achieving average swap regret $\epsilon$ against an oblivious adversary in general extensive-form games requires at least $\mathrm{exp}\left(\Omega\left(\min\left\{m^{1/14}, \epsilon^{-1/6}\right\}\right)\right)$ rounds.
- [88] arXiv:2406.13117 [pdf, html, other]
-
Title: State-of-the-Art Review: The Use of Digital Twins to Support Artificial Intelligence-Guided Predictive MaintenanceComments: This work has been submitted to Springer for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Artificial Intelligence (cs.AI)
In recent years, predictive maintenance (PMx) has gained prominence for its potential to enhance efficiency, automation, accuracy, and cost-effectiveness while reducing human involvement. Importantly, PMx has evolved in tandem with digital advancements, such as Big Data and the Internet of Things (IOT). These technological strides have enabled Artificial Intelligence (AI) to revolutionize PMx processes, with increasing capacities for real-time automation of monitoring, analysis, and prediction tasks. However, PMx still faces challenges such as poor explainability and sample inefficiency in data-driven methods and high complexity in physics-based models, hindering broader adoption. This paper posits that Digital Twins (DTs) can be integrated into PMx to overcome these challenges, paving the way for more automated PMx applications across various stakeholders. Despite their potential, current DTs have not fully matured to bridge existing gaps. Our paper provides a comprehensive roadmap for DT evolution, addressing current limitations to foster large-scale automated PMx progression. We structure our approach in three stages: First, we reference prior work where we identified and defined the Information Requirements (IRs) and Functional Requirements (FRs) for PMx, forming the blueprint for a unified framework. Second, we conduct a literature review to assess current DT applications integrating these IRs and FRs, revealing standardized DT models and tools that support automated PMx. Lastly, we highlight gaps in current DT implementations, particularly those IRs and FRs not fully supported, and outline the necessary components for a comprehensive, automated PMx system. Our paper concludes with research directions aimed at seamlessly integrating DTs into the PMx paradigm to achieve this ambitious vision.
- [89] arXiv:2406.13118 [pdf, html, other]
-
Title: Thruster-Assisted Incline WalkingKaushik Venkatesh Krishnamurthy, Chenghao Wang, Shreyansh Pitroda, Adarsh Salagame, Eric Sihite, Reza Nemovi, Alireza Ramezani, Morteza GharibComments: 7 pages, 7 figures, submitted to CDC 2024 conference. arXiv admin note: text overlap with arXiv:2405.06070Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
In this study, our aim is to evaluate the effectiveness of thruster-assisted steep slope walking for the Husky Carbon, a quadrupedal robot equipped with custom-designed actuators and plural electric ducted fans, through simulation prior to conducting experimental trials. Thruster-assisted steep slope walking draws inspiration from wing-assisted incline running (WAIR) observed in birds, and intriguingly incorporates posture manipulation and thrust vectoring, a locomotion technique not previously explored in the animal kingdom. Our approach involves developing a reduced-order model of the Husky robot, followed by the application of an optimization-based controller utilizing collocation methods and dynamics interpolation to determine control actions. Through simulation testing, we demonstrate the feasibility of hardware implementation of our controller.
- [90] arXiv:2406.13119 [pdf, html, other]
-
Title: GbHammer: Malicious Inter-process Page Sharing by Hammering Global Bits in Page Table EntriesComments: Presented in Fourth Workshop on DRAM Security (DRAMSec), June 29, 2024Subjects: Cryptography and Security (cs.CR)
RowHammer is a vulnerability inside DRAM chips where an attacker repeatedly accesses a DRAM row to flip bits in the nearby rows without directly accessing them. Several studies have found that flipping bits in the address part inside a page table entry (PTE) leads to serious security risks such as privilege escalation. However, the risk of management bits in a PTE being flipped by RowHammer has not yet been discussed as far as we know. In this paper, we point out a new vulnerability called GbHammer that allows an attacker to maliciously share a physical memory page with a victim by hammering the global bit in a PTE. GbHammer not only creates a shared page but also enables the attacker to (1) make the victim's process execute arbitrary binary and (2) snoop on the victim's secret data through the shared page. We demonstrate the two exploits on a real Linux kernel running on a cycle-accurate CPU simulator. We also discuss possible mitigation measures for GbHammer and the risk of GbHammer in non-x86 ISAs.
- [91] arXiv:2406.13121 [pdf, html, other]
-
Title: Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?Jinhyuk Lee, Anthony Chen, Zhuyun Dai, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, Yi Luan, Sébastien M. R. Arnold, Vincent Perot, Siddharth Dalmia, Hexiang Hu, Xudong Lin, Panupong Pasupat, Aida Amini, Jeremy R. Cole, Sebastian Riedel, Iftekhar Naim, Ming-Wei Chang, Kelvin GuuComments: 29 pages. Dataset available at this https URLSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. Leveraging LCLMs' ability to natively ingest and process entire corpora of information offers numerous advantages. It enhances user-friendliness by eliminating the need for specialized knowledge of tools, provides robust end-to-end modeling that minimizes cascading errors in complex pipelines, and allows for the application of sophisticated prompting techniques across the entire system. To assess this paradigm shift, we introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks. However, LCLMs still face challenges in areas like compositional reasoning that are required in SQL-like tasks. Notably, prompting strategies significantly influence performance, emphasizing the need for continued research as context lengths grow. Overall, LOFT provides a rigorous testing ground for LCLMs, showcasing their potential to supplant existing paradigms and tackle novel tasks as model capabilities scale.
- [92] arXiv:2406.13123 [pdf, html, other]
-
Title: ViLCo-Bench: VIdeo Language COntinual learning BenchmarkComments: 14 pages, 4 figures, 8 tables, under reviewSubjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Video language continual learning involves continuously adapting to information from video and text inputs, enhancing a model's ability to handle new tasks while retaining prior knowledge. This field is a relatively under-explored area, and establishing appropriate datasets is crucial for facilitating communication and research in this field. In this study, we present the first dedicated benchmark, ViLCo-Bench, designed to evaluate continual learning models across a range of video-text tasks. The dataset comprises ten-minute-long videos and corresponding language queries collected from publicly available datasets. Additionally, we introduce a novel memory-efficient framework that incorporates self-supervised learning and mimics long-term and short-term memory effects. This framework addresses challenges including memory complexity from long video clips, natural language complexity from open queries, and text-video misalignment. We posit that ViLCo-Bench, with greater complexity compared to existing continual learning benchmarks, would serve as a critical tool for exploring the video-language domain, extending beyond conventional class-incremental tasks, and addressing complex and limited annotation issues. The curated data, evaluations, and our novel method are available at this https URL .
- [93] arXiv:2406.13124 [pdf, html, other]
-
Title: Learning to Generate Answers with Citations via Factual Consistency ModelsComments: Accepted to ACL 2024. Code release will followSubjects: Computation and Language (cs.CL)
Large Language Models (LLMs) frequently hallucinate, impeding their reliability in mission-critical situations. One approach to address this issue is to provide citations to relevant sources alongside generated content, enhancing the verifiability of generations. However, citing passages accurately in answers remains a substantial challenge. This paper proposes a weakly-supervised fine-tuning method leveraging factual consistency models (FCMs). Our approach alternates between generating texts with citations and supervised fine-tuning with FCM-filtered citation data. Focused learning is integrated into the objective, directing the fine-tuning process to emphasise the factual unit tokens, as measured by an FCM. Results on the ALCE few-shot citation benchmark with various instruction-tuned LLMs demonstrate superior performance compared to in-context learning, vanilla supervised fine-tuning, and state-of-the-art methods, with an average improvement of $34.1$, $15.5$, and $10.5$ citation F$_1$ points, respectively. Moreover, in a domain transfer setting we show that the obtained citation generation ability robustly transfers to unseen datasets. Notably, our citation improvements contribute to the lowest factual error rate across baselines.
- [94] arXiv:2406.13125 [pdf, html, other]
-
Title: A Unified Framework for Combinatorial Optimization Based on Graph Neural NetworksSubjects: Artificial Intelligence (cs.AI)
Graph neural networks (GNNs) have emerged as a powerful tool for solving combinatorial optimization problems (COPs), exhibiting state-of-the-art performance in both graph-structured and non-graph-structured domains. However, existing approaches lack a unified framework capable of addressing a wide range of COPs. After presenting a summary of representative COPs and a brief review of recent advancements in GNNs for solving COPs, this paper proposes a unified framework for solving COPs based on GNNs, including graph representation of COPs, equivalent conversion of non-graph structured COPs to graph-structured COPs, graph decomposition, and graph simplification. The proposed framework leverages the ability of GNNs to effectively capture the relational information and extract features from the graph representation of COPs, offering a generic solution to COPs that can address the limitations of state-of-the-art in solving non-graph-structured and highly complex graph-structured COPs.
- [95] arXiv:2406.13126 [pdf, html, other]
-
Title: Guided Context Gating: Learning to leverage salient lesions in retinal fundus imagesComments: This paper has been accepted for presentation at the IEEE International Conference on Image Processing (ICIP 2024)Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Effectively representing medical images, especially retinal images, presents a considerable challenge due to variations in appearance, size, and contextual information of pathological signs called lesions. Precise discrimination of these lesions is crucial for diagnosing vision-threatening issues such as diabetic retinopathy. While visual attention-based neural networks have been introduced to learn spatial context and channel correlations from retinal images, they often fall short in capturing localized lesion context. Addressing this limitation, we propose a novel attention mechanism called Guided Context Gating, an unique approach that integrates Context Formulation, Channel Correlation, and Guided Gating to learn global context, spatial correlations, and localized lesion context. Our qualitative evaluation against existing attention mechanisms emphasize the superiority of Guided Context Gating in terms of explainability. Notably, experiments on the Zenodo-DR-7 dataset reveal a substantial 2.63% accuracy boost over advanced attention mechanisms & an impressive 6.53% improvement over the state-of-the-art Vision Transformer for assessing the severity grade of retinopathy, even with imbalanced and limited training samples for each class.
- [96] arXiv:2406.13127 [pdf, html, other]
-
Title: Oralytics Reinforcement Learning AlgorithmAnna L. Trella, Kelly W. Zhang, Stephanie M. Carpenter, David Elashoff, Zara M. Greer, Inbal Nahum-Shani, Dennis Ruenger, Vivek Shetty, Susan A. MurphySubjects: Artificial Intelligence (cs.AI)
Dental disease is still one of the most common chronic diseases in the United States. While dental disease is preventable through healthy oral self-care behaviors (OSCB), this basic behavior is not consistently practiced. We have developed Oralytics, an online, reinforcement learning (RL) algorithm that optimizes the delivery of personalized intervention prompts to improve OSCB. In this paper, we offer a full overview of algorithm design decisions made using prior data, domain expertise, and experiments in a simulation test bed. The finalized RL algorithm was deployed in the Oralytics clinical trial, conducted from fall 2023 to summer 2024.
- [97] arXiv:2406.13128 [pdf, html, other]
-
Title: A New Approach for Evaluating and Improving the Performance of Segmentation Algorithms on Hard-to-Detect Blood VesselsSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Many studies regarding the vasculature of biological tissues involve the segmentation of the blood vessels in a sample followed by the creation of a graph structure to model the vasculature. The graph is then used to extract relevant vascular properties. Small segmentation errors can lead to largely distinct connectivity patterns and a high degree of variability of the extracted properties. Nevertheless, global metrics such as Dice, precision, and recall are commonly applied for measuring the performance of blood vessel segmentation algorithms. These metrics might conceal important information about the accuracy at specific regions of a sample. To tackle this issue, we propose a local vessel salience (LVS) index to quantify the expected difficulty in segmenting specific blood vessel segments. The LVS index is calculated for each vessel pixel by comparing the local intensity of the vessel with the image background around the pixel. The index is then used for defining a new accuracy metric called low-salience recall (LSRecall), which quantifies the performance of segmentation algorithms on blood vessel segments having low salience. The perspective provided by the LVS index is used to define a data augmentation procedure that can be used to improve the segmentation performance of convolutional neural networks. We show that segmentation algorithms having high Dice and recall values can display very low LSRecall values, which reveals systematic errors of these algorithms for vessels having low salience. The proposed data augmentation procedure is able to improve the LSRecall of some samples by as much as 25%. The developed methodology opens up new possibilities for comparing the performance of segmentation algorithms regarding hard-to-detect blood vessels as well as their capabilities for vascular topology preservation.
- [98] arXiv:2406.13129 [pdf, html, other]
-
Title: M3T: Multi-Modal Medical Transformer to bridge Clinical Context with Visual Insights for Retinal Image Medical Description GenerationComments: This paper has been accepted for presentation at the IEEE International Conference on Image Processing (ICIP 2024)Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Automated retinal image medical description generation is crucial for streamlining medical diagnosis and treatment planning. Existing challenges include the reliance on learned retinal image representations, difficulties in handling multiple imaging modalities, and the lack of clinical context in visual representations. Addressing these issues, we propose the Multi-Modal Medical Transformer (M3T), a novel deep learning architecture that integrates visual representations with diagnostic keywords. Unlike previous studies focusing on specific aspects, our approach efficiently learns contextual information and semantics from both modalities, enabling the generation of precise and coherent medical descriptions for retinal images. Experimental studies on the DeepEyeNet dataset validate the success of M3T in meeting ophthalmologists' standards, demonstrating a substantial 13.5% improvement in BLEU@4 over the best-performing baseline model.
- [99] arXiv:2406.13130 [pdf, html, other]
-
Title: Advancing Retail Data Science: Comprehensive Evaluation of Synthetic DataSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The evaluation of synthetic data generation is crucial, especially in the retail sector where data accuracy is paramount. This paper introduces a comprehensive framework for assessing synthetic retail data, focusing on fidelity, utility, and privacy. Our approach differentiates between continuous and discrete data attributes, providing precise evaluation criteria. Fidelity is measured through stability and generalizability. Stability ensures synthetic data accurately replicates known data distributions, while generalizability confirms its robustness in novel scenarios. Utility is demonstrated through the synthetic data's effectiveness in critical retail tasks such as demand forecasting and dynamic pricing, proving its value in predictive analytics and strategic planning. Privacy is safeguarded using Differential Privacy, ensuring synthetic data maintains a perfect balance between resembling training and holdout datasets without compromising security. Our findings validate that this framework provides reliable and scalable evaluation for synthetic retail data. It ensures high fidelity, utility, and privacy, making it an essential tool for advancing retail data science. This framework meets the evolving needs of the retail industry with precision and confidence, paving the way for future advancements in synthetic data methodologies.
- [100] arXiv:2406.13131 [pdf, html, other]
-
Title: When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full ModelsSubjects: Computation and Language (cs.CL)
This paper studies in-context learning (ICL) by decomposing the output of large language models into the individual contributions of attention heads and MLPs (components). We observe curious components: good-performing ones that individually do well on a classification task, even when the model performs poorly; bad-performing ones that do much worse than chance; and label-biased components that always predict the same label. We find that component accuracies are well-correlated across different demonstration sets and perturbations of prompt templates, even when the full-model accuracy varies greatly. Based on our findings, we propose component reweighting, which learns to linearly re-scale the component activations from a few labeled examples. Given 24 labeled examples, our method improves by an average of 6.0% accuracy points over 24-shot ICL across 8 tasks on Llama-2-7B. Overall, this paper both enriches our understanding of ICL and provides a practical method for improvement by examining model internals.
- [101] arXiv:2406.13133 [pdf, html, other]
-
Title: PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation ModelSajib Acharjee Dip, Uddip Acharjee Shuvo, Tran Chau, Haoqiu Song, Petra Choi, Xuan Wang, Liqing ZhangComments: 9 pages, 3 figuresSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Genomics (q-bio.GN)
Pathogen identification is pivotal in diagnosing, treating, and preventing diseases, crucial for controlling infections and safeguarding public health. Traditional alignment-based methods, though widely used, are computationally intense and reliant on extensive reference databases, often failing to detect novel pathogens due to their low sensitivity and specificity. Similarly, conventional machine learning techniques, while promising, require large annotated datasets and extensive feature engineering and are prone to overfitting. Addressing these challenges, we introduce PathoLM, a cutting-edge pathogen language model optimized for the identification of pathogenicity in bacterial and viral sequences. Leveraging the strengths of pre-trained DNA models such as the Nucleotide Transformer, PathoLM requires minimal data for fine-tuning, thereby enhancing pathogen detection capabilities. It effectively captures a broader genomic context, significantly improving the identification of novel and divergent pathogens. We developed a comprehensive data set comprising approximately 30 species of viruses and bacteria, including ESKAPEE pathogens, seven notably virulent bacterial strains resistant to antibiotics. Additionally, we curated a species classification dataset centered specifically on the ESKAPEE group. In comparative assessments, PathoLM dramatically outperforms existing models like DciPatho, demonstrating robust zero-shot and few-shot capabilities. Furthermore, we expanded PathoLM-Sp for ESKAPEE species classification, where it showed superior performance compared to other advanced deep learning methods, despite the complexities of the task.
- [102] arXiv:2406.13136 [pdf, html, other]
-
Title: GVT2RPM: An Empirical Study for General Video Transformer Adaptation to Remote Physiological MeasurementComments: Code is available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Remote physiological measurement (RPM) is an essential tool for healthcare monitoring as it enables the measurement of physiological signs, e.g., heart rate, in a remote setting via physical wearables. Recently, with facial videos, we have seen rapid advancements in video-based RPMs. However, adopting facial videos for RPM in the clinical setting largely depends on the accuracy and robustness (work across patient populations). Fortunately, the capability of the state-of-the-art transformer architecture in general (natural) video understanding has resulted in marked improvements and has been translated to facial understanding, including RPM. However, existing RPM methods usually need RPM-specific modules, e.g., temporal difference convolution and handcrafted feature maps. Although these customized modules can increase accuracy, they are not demonstrated for their robustness across datasets. Further, due to their customization of the transformer architecture, they cannot use the advancements made in general video transformers (GVT). In this study, we interrogate the GVT architecture and empirically analyze how the training designs, i.e., data pre-processing and network configurations, affect the model performance applied to RPM. Based on the structure of video transformers, we propose to configure its spatiotemporal hierarchy to align with the dense temporal information needed in RPM for signal feature extraction. We define several practical guidelines and gradually adapt GVTs for RPM without introducing RPM-specific modules. Our experiments demonstrate favorable results to existing RPM-specific module counterparts. We conducted extensive experiments with five datasets using intra-dataset and cross-dataset settings. We highlight that the proposed guidelines GVT2RPM can be generalized to any video transformers and is robust to various datasets.
- [103] arXiv:2406.13137 [pdf, html, other]
-
Title: Efficient Sharpness-Aware Minimization for Molecular Graph Transformer ModelsSubjects: Machine Learning (cs.LG)
Sharpness-aware minimization (SAM) has received increasing attention in computer vision since it can effectively eliminate the sharp local minima from the training trajectory and mitigate generalization degradation. However, SAM requires two sequential gradient computations during the optimization of each step: one to obtain the perturbation gradient and the other to obtain the updating gradient. Compared with the base optimizer (e.g., Adam), SAM doubles the time overhead due to the additional perturbation gradient. By dissecting the theory of SAM and observing the training gradient of the molecular graph transformer, we propose a new algorithm named GraphSAM, which reduces the training cost of SAM and improves the generalization performance of graph transformer models. There are two key factors that contribute to this result: (i) \textit{gradient approximation}: we use the updating gradient of the previous step to approximate the perturbation gradient at the intermediate steps smoothly (\textbf{increases efficiency}); (ii) \textit{loss landscape approximation}: we theoretically prove that the loss landscape of GraphSAM is limited to a small range centered on the expected loss of SAM (\textbf{guarantees generalization performance}). The extensive experiments on six datasets with different tasks demonstrate the superiority of GraphSAM, especially in optimizing the model update process. The code is in:this https URL
- [104] arXiv:2406.13138 [pdf, html, other]
-
Title: Large Language Models are Biased Because They Are Large Language ModelsComments: Under review, 15 pagesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
This paper's primary goal is to provoke thoughtful discussion about the relationship between bias and fundamental properties of large language models. We do this by seeking to convince the reader that harmful biases are an inevitable consequence arising from the design of any large language model as LLMs are currently formulated. To the extent that this is true, it suggests that the problem of harmful bias cannot be properly addressed without a serious reconsideration of AI driven by LLMs, going back to the foundational assumptions underlying their design.
- [105] arXiv:2406.13140 [pdf, html, other]
-
Title: From decision aiding to the massive use of algorithms: where does the responsibility stand?Subjects: Computers and Society (cs.CY)
In the very large debates on ethics of algorithms, this paper proposes an analysis on human responsibility. On one hand, algorithms are designed by some humans, who bear a part of responsibility in the results and unexpected impacts. Nevertheless, we show how the fact they cannot embrace the full situations of use and consequences lead to an unreachable limit. On the other hand, using technology is never free of responsibility, even if there also exist limits to characterise. Massive uses by unprofessional users introduce additional questions that modify the possibilities to be ethically responsible. The article is structured in such a way as to show how the limits have gradually evolved, leaving unthought of issues and a failure to share responsibility.
- [106] arXiv:2406.13144 [pdf, html, other]
-
Title: DialSim: A Real-Time Simulator for Evaluating Long-Term Dialogue Understanding of Conversational AgentsJiho Kim, Woosog Chay, Hyeonji Hwang, Daeun Kyung, Hyunseung Chung, Eunbyeol Cho, Yohan Jo, Edward ChoiSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Recent advancements in Large Language Models (LLMs) have significantly enhanced the capabilities of conversational agents, making them applicable to various fields (e.g., education). Despite their progress, the evaluation of the agents often overlooks the complexities of real-world conversations, such as real-time interactions, multi-party dialogues, and extended contextual dependencies. To bridge this gap, we introduce DialSim, a real-time dialogue simulator. In this simulator, an agent is assigned the role of a character from popular TV shows, requiring it to respond to spontaneous questions using past dialogue information and to distinguish between known and unknown information. Key features of DialSim include evaluating the agent's ability to respond within a reasonable time limit, handling long-term multi-party dialogues, and managing adversarial settings (e.g., swap character names) to challenge the agent's reliance on pre-trained knowledge. We utilized this simulator to evaluate the latest conversational agents and analyze their limitations. Our experiments highlight both the strengths and weaknesses of these agents, providing valuable insights for future improvements in the field of conversational AI. DialSim is available at this https URL.
- [107] arXiv:2406.13145 [pdf, html, other]
-
Title: Constructing and Evaluating Digital Twins: An Intelligent Framework for DT DevelopmentSubjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
The development of Digital Twins (DTs) represents a transformative advance for simulating and optimizing complex systems in a controlled digital space. Despite their potential, the challenge of constructing DTs that accurately replicate and predict the dynamics of real-world systems remains substantial. This paper introduces an intelligent framework for the construction and evaluation of DTs, specifically designed to enhance the accuracy and utility of DTs in testing algorithmic performance. We propose a novel construction methodology that integrates deep learning-based policy gradient techniques to dynamically tune the DT parameters, ensuring high fidelity in the digital replication of physical systems. Moreover, the Mean STate Error (MSTE) is proposed as a robust metric for evaluating the performance of algorithms within these digital space. The efficacy of our framework is demonstrated through extensive simulations that show our DT not only accurately mirrors the physical reality but also provides a reliable platform for algorithm evaluation. This work lays a foundation for future research into DT technologies, highlighting pathways for both theoretical enhancements and practical implementations in various industries.
- [108] arXiv:2406.13147 [pdf, html, other]
-
Title: A Simulation Environment for the Neuroevolution of Ant Colony DynamicsComments: Accepted for publication at The 2024 Conference on Artificial Life. 2 page extended abstractSubjects: Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE); Adaptation and Self-Organizing Systems (nlin.AO)
We introduce a simulation environment to facilitate research into emergent collective behaviour, with a focus on replicating the dynamics of ant colonies. By leveraging real-world data, the environment simulates a target ant trail that a controllable agent must learn to replicate, using sensory data observed by the target ant. This work aims to contribute to the neuroevolution of models for collective behaviour, focusing on evolving neural architectures that encode domain-specific behaviours in the network topology. By evolving models that can be modified and studied in a controlled environment, we can uncover the necessary conditions required for collective behaviours to emerge. We hope this environment will be useful to those studying the role of interactions in emergent behaviour within collective systems.
- [109] arXiv:2406.13149 [pdf, html, other]
-
Title: High-Fidelity Facial Albedo Estimation via Texture QuantizationZimin Ran, Xingyu Ren, Xiang An, Kaicheng Yang, Xiangzi Dai, Ziyong Feng, Jia Guo, Linchao Zhu, Jiankang DengSubjects: Computer Vision and Pattern Recognition (cs.CV)
Recent 3D face reconstruction methods have made significant progress in shape estimation, but high-fidelity facial albedo reconstruction remains challenging. Existing methods depend on expensive light-stage captured data to learn facial albedo maps. However, a lack of diversity in subjects limits their ability to recover high-fidelity results. In this paper, we present a novel facial albedo reconstruction model, HiFiAlbedo, which recovers the albedo map directly from a single image without the need for captured albedo data. Our key insight is that the albedo map is the illumination invariant texture map, which enables us to use inexpensive texture data to derive an albedo estimation by eliminating illumination. To achieve this, we first collect large-scale ultra-high-resolution facial images and train a high-fidelity facial texture codebook. By using the FFHQ dataset and limited UV textures, we then fine-tune the encoder for texture reconstruction from the input image with adversarial supervision in both image and UV space. Finally, we train a cross-attention module and utilize group identity loss to learn the adaptation from facial texture to the albedo domain. Extensive experimentation has demonstrated that our method exhibits excellent generalizability and is capable of achieving high-fidelity results for in-the-wild facial albedo recovery. Our code, pre-trained weights, and training data will be made publicly available at this https URL.
- [110] arXiv:2406.13152 [pdf, html, other]
-
Title: Analyzing Diversity in Healthcare LLM Research: A Scientometric PerspectiveSubjects: Computation and Language (cs.CL)
The deployment of large language models (LLMs) in healthcare has demonstrated substantial potential for enhancing clinical decision-making, administrative efficiency, and patient outcomes. However, the underrepresentation of diverse groups in the development and application of these models can perpetuate biases, leading to inequitable healthcare delivery. This paper presents a comprehensive scientometric analysis of LLM research for healthcare, including data from January 1, 2021, to June 16, 2024. By analyzing metadata from PubMed and Dimensions, including author affiliations, countries, and funding sources, we assess the diversity of contributors to LLM research. Our findings highlight significant gender and geographic disparities, with a predominance of male authors and contributions primarily from high-income countries (HICs). We introduce a novel journal diversity index based on Gini impurity to measure the inclusiveness of scientific publications. Our results underscore the necessity for greater representation in order to ensure the equitable application of LLMs in healthcare. We propose actionable strategies to enhance diversity and inclusivity in artificial intelligence research, with the ultimate goal of fostering a more inclusive and equitable future in healthcare innovation.
- [111] arXiv:2406.13153 [pdf, html, other]
-
Title: SwinStyleformer is a favorable choice for image inversionSubjects: Computer Vision and Pattern Recognition (cs.CV)
This paper proposes the first pure Transformer structure inversion network called SwinStyleformer, which can compensate for the shortcomings of the CNNs inversion framework by handling long-range dependencies and learning the global structure of objects. Experiments found that the inversion network with the Transformer backbone could not successfully invert the image. The above phenomena arise from the differences between CNNs and Transformers, such as the self-attention weights favoring image structure ignoring image details compared to convolution, the lack of multi-scale properties of Transformer, and the distribution differences between the latent code extracted by the Transformer and the StyleGAN style vector. To address these differences, we employ the Swin Transformer with a smaller window size as the backbone of the SwinStyleformer to enhance the local detail of the inversion image. Meanwhile, we design a Transformer block based on learnable queries. Compared to the self-attention transformer block, the Transformer block based on learnable queries provides greater adaptability and flexibility, enabling the model to update the attention weights according to specific tasks. Thus, the inversion focus is not limited to the image structure. To further introduce multi-scale properties, we design multi-scale connections in the extraction of feature maps. Multi-scale connections allow the model to gain a comprehensive understanding of the image to avoid loss of detail due to global modeling. Moreover, we propose an inversion discriminator and distribution alignment loss to minimize the distribution differences. Based on the above designs, our SwinStyleformer successfully solves the Transformer's inversion failure issue and demonstrates SOTA performance in image inversion and several related vision tasks.
- [112] arXiv:2406.13155 [pdf, html, other]
-
Title: Convolutional Kolmogorov-Arnold NetworksSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
In this paper, we introduce the Convolutional Kolmogorov-Arnold Networks (Convolutional KANs), an innovative alternative to the standard Convolutional Neural Networks (CNNs) that have revolutionized the field of computer vision. We integrate the non-linear activation functions presented in Kolmogorov-Arnold Networks (KANs) into convolutions to build a new layer. Throughout the paper, we empirically validate the performance of Convolutional KANs against traditional architectures across MNIST and Fashion-MNIST benchmarks, illustrating that this new approach maintains a similar level of accuracy while using half the amount of parameters. This significant reduction of parameters opens up a new approach to advance the optimization of neural network architectures.
- [113] arXiv:2406.13158 [pdf, html, other]
-
Title: Utility Pole Fire Risk Inspection from 2D Street-Side ImagesSubjects: Computer Vision and Pattern Recognition (cs.CV)
In recent years, California's electrical grid has confronted mounting challenges stemming from aging infrastructure and a landscape increasingly susceptible to wildfires. This paper presents a comprehensive framework utilizing computer vision techniques to address wildfire risk within the state's electrical grid, with a particular focus on vulnerable utility poles. These poles are susceptible to fire outbreaks or structural failure during extreme weather events. The proposed pipeline harnesses readily available Google Street View imagery to identify utility poles and assess their proximity to surrounding vegetation, as well as to determine any inclination angles. The early detection of potential risks associated with utility poles is pivotal for forestalling wildfire ignitions and informing strategic investments, such as undergrounding vulnerable poles and powerlines. Moreover, this study underscores the significance of data-driven decision-making in bolstering grid resilience, particularly concerning Public Safety Power Shutoffs. By fostering collaboration among utilities, policymakers, and researchers, this pipeline aims to solidify the electric grid's resilience and safeguard communities against the escalating threat of wildfires.
- [114] arXiv:2406.13161 [pdf, other]
-
Title: APPL: A Prompt Programming Language for Harmonious Integration of Programs and Large Language Model PromptsHonghua Dong, Qidong Su, Yubo Gao, Zhaoyu Li, Yangjun Ruan, Gennady Pekhimenko, Chris J. Maddison, Xujie SiSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Programming Languages (cs.PL)
Large Language Models (LLMs) have become increasingly capable of handling diverse tasks with the aid of well-crafted prompts and integration of external tools, but as task complexity rises, the workflow involving LLMs can be complicated and thus challenging to implement and maintain. To address this challenge, we propose APPL, A Prompt Programming Language that acts as a bridge between computer programs and LLMs, allowing seamless embedding of prompts into Python functions, and vice versa. APPL provides an intuitive and Python-native syntax, an efficient parallelized runtime with asynchronous semantics, and a tracing module supporting effective failure diagnosis and replaying without extra costs. We demonstrate that APPL programs are intuitive, concise, and efficient through three representative scenarios: Chain-of-Thought with self-consistency (CoT-SC), ReAct tool use agent, and multi-agent chat. Experiments on three parallelizable workflows further show that APPL can effectively parallelize independent LLM calls, with a significant speedup ratio that almost matches the estimation.
- [115] arXiv:2406.13162 [pdf, html, other]
-
Title: AntibodyFlow: Normalizing Flow Model for Designing Antibody Complementarity-Determining RegionsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
Therapeutic antibodies have been extensively studied in drug discovery and development in the past decades. Antibodies are specialized protective proteins that bind to antigens in a lock-to-key manner. The binding strength/affinity between an antibody and a specific antigen is heavily determined by the complementarity-determining regions (CDRs) on the antibodies. Existing machine learning methods cast in silico development of CDRs as either sequence or 3D graph (with a single chain) generation tasks and have achieved initial success. However, with CDR loops having specific geometry shapes, learning the 3D geometric structures of CDRs remains a challenge. To address this issue, we propose AntibodyFlow, a 3D flow model to design antibody CDR loops. Specifically, AntibodyFlow first constructs the distance matrix, then predicts amino acids conditioned on the distance matrix. Also, AntibodyFlow conducts constraint learning and constrained generation to ensure valid 3D structures. Experimental results indicate that AntibodyFlow outperforms the best baseline consistently with up to 16.0% relative improvement in validity rate and 24.3% relative reduction in geometric graph level error (root mean square deviation, RMSD).
- [116] arXiv:2406.13166 [pdf, other]
-
Title: Enhancing supply chain security with automated machine learningComments: 22 pagesSubjects: Machine Learning (cs.LG); General Economics (econ.GN); Optimization and Control (math.OC)
This study tackles the complexities of global supply chains, which are increasingly vulnerable to disruptions caused by port congestion, material shortages, and inflation. To address these challenges, we explore the application of machine learning methods, which excel in predicting and optimizing solutions based on large datasets. Our focus is on enhancing supply chain security through fraud detection, maintenance prediction, and material backorder forecasting. We introduce an automated machine learning framework that streamlines data analysis, model construction, and hyperparameter optimization for these tasks. By automating these processes, our framework improves the efficiency and effectiveness of supply chain security measures. Our research identifies key factors that influence machine learning performance, including sampling methods, categorical encoding, feature selection, and hyperparameter optimization. We demonstrate the importance of considering these factors when applying machine learning to supply chain challenges. Traditional mathematical programming models often struggle to cope with the complexity of large-scale supply chain problems. Our study shows that machine learning methods can provide a viable alternative, particularly when dealing with extensive datasets and complex patterns. The automated machine learning framework presented in this study offers a novel approach to supply chain security, contributing to the existing body of knowledge in the field. Its comprehensive automation of machine learning processes makes it a valuable contribution to the domain of supply chain management.
- [117] arXiv:2406.13167 [pdf, html, other]
-
Title: QRMeM: Unleash the Length Limitation through Question then Reflection Memory MechanismSubjects: Computation and Language (cs.CL)
While large language models (LLMs) have made notable advancements in natural language processing, they continue to struggle with processing extensive text. Memory mechanism offers a flexible solution for managing long contexts, utilizing techniques such as compression, summarization, and structuring to facilitate nuanced and efficient handling of large volumes of text. However, existing techniques face challenges with static knowledge integration, leading to insufficient adaptation to task-specific needs and missing multi-segmentation relationships, which hinders the dynamic reorganization and logical combination of relevant segments during the response process. To address these issues, we introduce a novel strategy, Question then Reflection Memory Mechanism (QRMeM), incorporating a dual-structured memory pool. This pool synergizes static textual content with structured graph guidance, fostering a reflective trial-and-error approach for navigating and identifying relevant segments. Our evaluation across multiple-choice questions (MCQ) and multi-document question answering (Multi-doc QA) benchmarks showcases QRMeM enhanced performance compared to existing approaches.
- [118] arXiv:2406.13170 [pdf, html, other]
-
Title: Amphista: Accelerate LLM Inference with Bi-directional Multiple Drafting Heads in a Non-autoregressive StyleZeping Li, Xinlong Yang, Ziheng Gao, Ji Liu, Zhuang Liu, Dong Li, Jinzhang Peng, Lu Tian, Emad BarsoumSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Large Language Models (LLMs) inherently use autoregressive decoding, which lacks parallelism in inference and results in significantly slow inference speeds, especially when hardware parallel accelerators and memory bandwidth are not fully utilized. In this work, we propose Amphista, a speculative decoding algorithm that adheres to a non-autoregressive decoding paradigm. Owing to the increased parallelism, our method demonstrates higher efficiency in inference compared to autoregressive methods. Specifically, Amphista models an Auto-embedding Block capable of parallel inference, incorporating bi-directional attention to enable interaction between different drafting heads. Additionally, Amphista implements Staged Adaptation Layers to facilitate the transition of semantic information from the base model's autoregressive inference to the drafting heads' non-autoregressive speculation, thereby achieving paradigm transformation and feature fusion. We conduct a series of experiments on a suite of Vicuna models using MT-Bench and Spec-Bench. For the Vicuna 33B model, Amphista achieves up to 2.75$\times$ and 1.40$\times$ wall-clock acceleration compared to vanilla autoregressive decoding and Medusa, respectively, while preserving lossless generation quality.
- [119] arXiv:2406.13173 [pdf, html, other]
-
Title: Biomedical Visual Instruction Tuning with Clinician Preference AlignmentSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Recent advancements in multimodal foundation models have showcased impressive capabilities in understanding and reasoning with visual and textual information. Adapting these foundation models trained for general usage to specialized domains like biomedicine requires large-scale domain-specific instruction datasets. While existing works have explored curating such datasets automatically, the resultant datasets are not explicitly aligned with domain expertise. In this work, we propose a data-centric framework, Biomedical Visual Instruction Tuning with Clinician Preference Alignment (BioMed-VITAL), that incorporates clinician preferences into both stages of generating and selecting instruction data for tuning biomedical multimodal foundation models. First, during the generation stage, we prompt the GPT-4V generator with a diverse set of clinician-selected demonstrations for preference-aligned data candidate generation. Then, during the selection phase, we train a separate selection model, which explicitly distills clinician and policy-guided model preferences into a rating function to select high-quality data for medical instruction tuning. Results show that the model tuned with the instruction-following data from our method demonstrates a significant improvement in open visual chat (18.5% relatively) and medical VQA (win rate up to 81.73%). Our instruction-following data and models are available at this http URL.
- [120] arXiv:2406.13175 [pdf, html, other]
-
Title: Sparse High Rank AdaptersKartikeya Bhardwaj, Nilesh Prasad Pandey, Sweta Priyadarshi, Viswanath Ganapathy, Rafael Esteves, Shreya Kadambi, Shubhankar Borse, Paul Whatmough, Risheek Garrepalli, Mart Van Baalen, Harris Teague, Markus NagelSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Low Rank Adaptation (LoRA) has gained massive attention in the recent generative AI research. One of the main advantages of LoRA is its ability to be fused with pretrained models adding no overhead during inference. However, from a mobile deployment standpoint, we can either avoid inference overhead in the fused mode but lose the ability to switch adapters rapidly, or suffer significant (up to 30% higher) inference latency while enabling rapid switching in the unfused mode. LoRA also exhibits concept-loss when multiple adapters are used concurrently. In this paper, we propose Sparse High Rank Adapters (SHiRA), a new paradigm which incurs no inference overhead, enables rapid switching, and significantly reduces concept-loss. Specifically, SHiRA can be trained by directly tuning only 1-2% of the base model weights while leaving others unchanged. This results in a highly sparse adapter which can be switched directly in the fused mode. We further provide theoretical and empirical insights on how high sparsity in SHiRA can aid multi-adapter fusion by reducing concept loss. Our extensive experiments on LVMs and LLMs demonstrate that finetuning only a small fraction of the parameters in the base model is sufficient for many tasks while enabling both rapid switching and multi-adapter fusion. Finally, we provide a latency- and memory-efficient SHiRA implementation based on Parameter-Efficient Finetuning (PEFT) Library. This implementation trains at nearly the same speed as LoRA while consuming lower peak GPU memory, thus making SHiRA easy to adopt for practical use cases.
- [121] arXiv:2406.13177 [pdf, html, other]
-
Title: Transferable Watermarking to Self-supervised Pre-trained Graph Encoders by Trigger EmbeddingsComments: this https URLSubjects: Cryptography and Security (cs.CR)
Recent years have witnessed the prosperous development of Graph Self-supervised Learning (GSSL), which enables to pre-train transferable foundation graph encoders. However, the easy-to-plug-in nature of such encoders makes them vulnerable to copyright infringement. To address this issue, we develop a novel watermarking framework to protect graph encoders in GSSL settings. The key idea is to force the encoder to map a set of specially crafted trigger instances into a unique compact cluster in the outputted embedding space during model pre-training. Consequently, when the encoder is stolen and concatenated with any downstream classifiers, the resulting model inherits the backdoor of the encoder and predicts the trigger instances to be in a single category with high probability regardless of the ground truth. Experimental results have shown that, the embedded watermark can be transferred to various downstream tasks in black-box settings, including node classification, link prediction and community detection, which forms a reliable watermark verification system for GSSL in reality. This approach also shows satisfactory performance in terms of model fidelity, reliability and robustness.
- [122] arXiv:2406.13179 [pdf, html, other]
-
Title: Global-Local Convolution with Spiking Neural Networks for Energy-efficient Keyword SpottingSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Audio and Speech Processing (eess.AS)
Thanks to Deep Neural Networks (DNNs), the accuracy of Keyword Spotting (KWS) has made substantial progress. However, as KWS systems are usually implemented on edge devices, energy efficiency becomes a critical requirement besides performance. Here, we take advantage of spiking neural networks' energy efficiency and propose an end-to-end lightweight KWS model. The model consists of two innovative modules: 1) Global-Local Spiking Convolution (GLSC) module and 2) Bottleneck-PLIF module. Compared to the hand-crafted feature extraction methods, the GLSC module achieves speech feature extraction that is sparser, more energy-efficient, and yields better performance. The Bottleneck-PLIF module further processes the signals from GLSC with the aim to achieve higher accuracy with fewer parameters. Extensive experiments are conducted on the Google Speech Commands Dataset (V1 and V2). The results show our method achieves competitive performance among SNN-based KWS models with fewer parameters.
- [123] arXiv:2406.13180 [pdf, html, other]
-
Title: Towards Trustworthy Unsupervised Domain Adaptation: A Representation Learning Perspective for Enhancing Robustness, Discrimination, and GeneralizationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Robust Unsupervised Domain Adaptation (RoUDA) aims to achieve not only clean but also robust cross-domain knowledge transfer from a labeled source domain to an unlabeled target domain. A number of works have been conducted by directly injecting adversarial training (AT) in UDA based on the self-training pipeline and then aiming to generate better adversarial examples (AEs) for AT. Despite the remarkable progress, these methods only focus on finding stronger AEs but neglect how to better learn from these AEs, thus leading to unsatisfied results. In this paper, we investigate robust UDA from a representation learning perspective and design a novel algorithm by utilizing the mutual information theory, dubbed MIRoUDA. Specifically, through mutual information optimization, MIRoUDA is designed to achieve three characteristics that are highly expected in robust UDA, i.e., robustness, discrimination, and generalization. We then propose a dual-model framework accordingly for robust UDA learning. Extensive experiments on various benchmarks verify the effectiveness of the proposed MIRoUDA, in which our method surpasses the state-of-the-arts by a large margin.
- [124] arXiv:2406.13181 [pdf, html, other]
-
Title: The Impact of Auxiliary Patient Data on Automated Chest X-Ray Report Generation and How to Incorporate ItSubjects: Computer Vision and Pattern Recognition (cs.CV)
This study investigates the integration of diverse patient data sources into multimodal language models for automated chest X-ray (CXR) report generation. Traditionally, CXR report generation relies solely on CXR images and limited radiology data, overlooking valuable information from patient health records, particularly from emergency departments. Utilising the MIMIC-CXR and MIMIC-IV-ED datasets, we incorporate detailed patient information such as aperiodic vital signs, medications, and clinical history to enhance diagnostic accuracy. We introduce a novel approach to transform these heterogeneous data sources into embeddings that prompt a multimodal language model, significantly enhancing the diagnostic accuracy of generated radiology reports. Our comprehensive evaluation demonstrates the benefits of using a broader set of patient data, underscoring the potential for enhanced diagnostic capabilities and better patient outcomes through the integration of multimodal data in CXR report generation.
- [125] arXiv:2406.13183 [pdf, html, other]
-
Title: Communication-Efficient and Privacy-Preserving Decentralized Meta-LearningSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
Distributed learning, which does not require gathering training data in a central location, has become increasingly important in the big-data era. In particular, random-walk-based decentralized algorithms are flexible in that they do not need a central server trusted by all clients and do not require all clients to be active in all iterations. However, existing distributed learning algorithms assume that all learning clients share the same task. In this paper, we consider the more difficult meta-learning setting, in which different clients perform different (but related) tasks with limited training data. To reduce communication cost and allow better privacy protection, we propose LoDMeta (Local Decentralized Meta-learning) with the use of local auxiliary optimization parameters and random perturbations on the model parameter. Theoretical results are provided on both convergence and privacy analysis. Empirical results on a number of few-shot learning data sets demonstrate that LoDMeta has similar meta-learning accuracy as centralized meta-learning algorithms, but does not require gathering data from each client and is able to better protect data privacy for each client.
- [126] arXiv:2406.13184 [pdf, html, other]
-
Title: Locating and Extracting Relational Concepts in Large Language ModelsJournal-ref: Findings of ACL2024Subjects: Computation and Language (cs.CL)
Relational concepts are indeed foundational to the structure of knowledge representation, as they facilitate the association between various entity concepts, allowing us to express and comprehend complex world knowledge. By expressing relational concepts in natural language prompts, people can effortlessly interact with large language models (LLMs) and recall desired factual knowledge. However, the process of knowledge recall lacks interpretability, and representations of relational concepts within LLMs remain unknown to us. In this paper, we identify hidden states that can express entity and relational concepts through causal mediation analysis in fact recall processes. Our finding reveals that at the last token position of the input prompt, there are hidden states that solely express the causal effects of relational concepts. Based on this finding, we assume that these hidden states can be treated as relational representations and we can successfully extract them from LLMs. The experimental results demonstrate high credibility of the relational representations: they can be flexibly transplanted into other fact recall processes, and can also be used as robust entity connectors. Moreover, we also show that the relational representations exhibit significant potential for controllable fact recall through relation rewriting.
- [127] arXiv:2406.13185 [pdf, html, other]
-
Title: Learnable In-Context Vector for Visual Question AnsweringSubjects: Computation and Language (cs.CL)
As language models continue to scale, Large Language Models (LLMs) have exhibited emerging capabilities in In-Context Learning (ICL), enabling them to solve language tasks by prefixing a few in-context demonstrations (ICDs) as context. Inspired by these advancements, researchers have extended these techniques to develop Large Multimodal Models (LMMs) with ICL capabilities. However, applying ICL usually faces two major challenges: 1) using more ICDs will largely increase the inference time and 2) the performance is sensitive to the selection of ICDs. These challenges are further exacerbated in LMMs due to the integration of multiple data types and the combinational complexity of multimodal ICDs. Recently, to address these challenges, some NLP studies introduce non-learnable In-Context Vectors (ICVs) which extract useful task information from ICDs into a single vector and then insert it into the LLM to help solve the corresponding task. However, although useful in simple NLP tasks, these non-learnable methods fail to handle complex multimodal tasks like Visual Question Answering (VQA). In this study, we propose \textbf{Learnable ICV} (L-ICV) to distill essential task information from demonstrations, improving ICL performance in LMMs. Experiments show that L-ICV can significantly reduce computational costs while enhancing accuracy in VQA tasks compared to traditional ICL and other non-learnable ICV methods.
- [128] arXiv:2406.13186 [pdf, html, other]
-
Title: A Federated Learning Approach for Multi-stage Threat Analysis in Advanced Persistent Threat CampaignsSubjects: Cryptography and Security (cs.CR)
Multi-stage threats like advanced persistent threats (APT) pose severe risks by stealing data and destroying infrastructure, with detection being challenging. APTs use novel attack vectors and evade signature-based detection by obfuscating their network presence, often going unnoticed due to their novelty. Although machine learning models offer high accuracy, they still struggle to identify true APT behavior, overwhelming analysts with excessive data. Effective detection requires training on multiple datasets from various clients, which introduces privacy issues under regulations like GDPR. To address these challenges, this paper proposes a novel 3-phase unsupervised federated learning (FL) framework to detect APTs. It identifies unique log event types, extracts suspicious patterns from related log events, and orders them by complexity and frequency. The framework ensures privacy through a federated approach and enhances security using Paillier's partial homomorphic encryption. Tested on the SoTM 34 dataset, our framework compares favorably against traditional methods, demonstrating efficient pattern extraction and analysis from log files, reducing analyst workload, and maintaining stringent data privacy. This approach addresses significant gaps in current methodologies, offering a robust solution to APT detection in compliance with privacy laws.
- [129] arXiv:2406.13187 [pdf, html, other]
-
Title: Boosting Consistency in Dual Training for Long-Tailed Semi-Supervised LearningSubjects: Machine Learning (cs.LG)
While long-tailed semi-supervised learning (LTSSL) has received tremendous attention in many real-world classification problems, existing LTSSL algorithms typically assume that the class distributions of labeled and unlabeled data are almost identical. Those LTSSL algorithms built upon the assumption can severely suffer when the class distributions of labeled and unlabeled data are mismatched since they utilize biased pseudo-labels from the model. To alleviate this problem, we propose a new simple method that can effectively utilize unlabeled data from unknown class distributions through Boosting cOnsistency in duAl Training (BOAT). Specifically, we construct the standard and balanced branch to ensure the performance of the head and tail classes, respectively. Throughout the training process, the two branches incrementally converge and interact with each other, eventually resulting in commendable performance across all classes. Despite its simplicity, we show that BOAT achieves state-of-the-art performance on a variety of standard LTSSL benchmarks, e.g., an averaged 2.7% absolute increase in test accuracy against existing algorithms when the class distributions of labeled and unlabeled data are mismatched. Even when the class distributions are identical, BOAT consistently outperforms many sophisticated LTSSL algorithms. We carry out extensive ablation studies to tease apart the factors that are the most important to the success of BOAT. The source code is available at this https URL.
- [130] arXiv:2406.13188 [pdf, html, other]
-
Title: Synthetic Context Generation for Question GenerationSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Despite rapid advancements in large language models (LLMs), QG remains a challenging problem due to its complicated process, open-ended nature, and the diverse settings in which question generation occurs. A common approach to address these challenges involves fine-tuning smaller, custom models using datasets containing background context, question, and answer. However, obtaining suitable domain-specific datasets with appropriate context is often more difficult than acquiring question-answer pairs. In this paper, we investigate training QG models using synthetic contexts generated by LLMs from readily available question-answer pairs. We conduct a comprehensive study to answer critical research questions related to the performance of models trained on synthetic contexts and their potential impact on QG research and applications. Our empirical results reveal: 1) contexts are essential for QG tasks, even if they are synthetic; 2) fine-tuning smaller language models has the capability of achieving better performances as compared to prompting larger language models; and 3) synthetic context and real context could achieve comparable performances. These findings highlight the effectiveness of synthetic contexts in QG and paves the way for future advancements in the field.
- [131] arXiv:2406.13191 [pdf, html, other]
-
Title: GPU-Accelerated DCOPF using Gradient-Based OptimizationSubjects: Systems and Control (eess.SY)
DC Optimal Power Flow (DCOPF) is a key operational tool for power system operators, and it is embedded as a subproblem in many challenging optimization problems (e.g., line switching). However, traditional CPU-based solve routines (e.g., simplex) have saturated in speed and are hard to parallelize. This paper focuses on solving DCOPF problems using gradient-based routines on Graphics Processing Units (GPUs), which have massive parallelization capability. To formulate these problems, we pose a Lagrange dual associated with DCOPF (linear and quadratic cost curves), and then we explicitly solve the inner (primal) minimization problem with a dual norm. The resulting dual problem can be efficiently iterated using projected gradient ascent. After solving the dual problem on both CPUs and GPUs to find tight lower bounds, we benchmark against Gurobi and MOSEK, comparing convergence speed and tightness on the IEEE 2000 and 10000 bus systems. We provide reliable and tight lower bounds for these problems with, at best, 5.4x speedup over a conventional solver.
- [132] arXiv:2406.13192 [pdf, html, other]
-
Title: Recovery of rational functions via Hankel pencil method and sensitivities of the polesSubjects: Numerical Analysis (math.NA)
In the paper, we develop a new method for the recovery of rational functions. Our idea is based on the property that Fourier coefficients of rational functions have the exponential structure and reconstruction of this exponential structure with the ESPRIT method in the frequency domain. Further we present sensitivity analysis for poles of rational functions reconstructed with our method in case of unstructured and structured perturbations. Finally, we consider several numerical experiments and, using sensitivities, explain the recovery errors for poles.
- [133] arXiv:2406.13193 [pdf, html, other]
-
Title: PRESTO: Progressive Pretraining Enhances Synthetic Chemistry OutcomesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Chemical Physics (physics.chem-ph)
Multimodal Large Language Models (MLLMs) have seen growing adoption across various scientific disciplines. These advancements encourage the investigation of molecule-text modeling within synthetic chemistry, a field dedicated to designing and conducting chemical reactions to synthesize new compounds with desired properties and applications. Current approaches, however, often neglect the critical role of multiple molecule graph interaction in understanding chemical reactions, leading to suboptimal performance in synthetic chemistry tasks. This study introduces PRESTO(Progressive Pretraining Enhances Synthetic Chemistry Outcomes), a new framework that bridges the molecule-text modality gap by integrating a comprehensive benchmark of pretraining strategies and dataset configurations. It progressively improves multimodal LLMs through cross-modal alignment and multi-graph understanding. Our extensive experiments demonstrate that PRESTO offers competitive results in downstream synthetic chemistry tasks. The code can be found at this https URL.
- [134] arXiv:2406.13195 [pdf, html, other]
-
Title: Learning Translations via Matrix CompletionDerry Wijaya, Brendan Callahan, John Hewitt, Jie Gao, Xiao Ling, Marianna Apidianaki, Chris Callison-BurchComments: This is a late posting of an old paper as Google Scholar somehow misses indexing the ACL anthology version of the paperJournal-ref: Volume: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Year: 2017, Pages: 1452-1463Subjects: Computation and Language (cs.CL)
Bilingual Lexicon Induction is the task of learning word translations without bilingual parallel corpora. We model this task as a matrix completion problem, and present an effective and extendable framework for completing the matrix. This method harnesses diverse bilingual and monolingual signals, each of which may be incomplete or noisy. Our model achieves state-of-the-art performance for both high and low resource languages.
- [135] arXiv:2406.13200 [pdf, html, other]
-
Title: RobGC: Towards Robust Graph CondensationSubjects: Machine Learning (cs.LG)
Graph neural networks (GNNs) have attracted widespread attention for their impressive capability of graph representation learning. However, the increasing prevalence of large-scale graphs presents a significant challenge for GNN training due to their computational demands, limiting the applicability of GNNs in various scenarios. In response to this challenge, graph condensation (GC) is proposed as a promising acceleration solution, focusing on generating an informative compact graph that enables efficient training of GNNs while retaining performance. Despite the potential to accelerate GNN training, existing GC methods overlook the quality of large training graphs during both the training and inference stages. They indiscriminately emulate the training graph distributions, making the condensed graphs susceptible to noises within the training graph and significantly impeding the application of GC in intricate real-world scenarios. To address this issue, we propose robust graph condensation (RobGC), a plug-and-play approach for GC to extend the robustness and applicability of condensed graphs in noisy graph structure environments. Specifically, RobGC leverages the condensed graph as a feedback signal to guide the denoising process on the original training graph. A label propagation-based alternating optimization strategy is in place for the condensation and denoising processes, contributing to the mutual purification of the condensed graph and training graph. Additionally, as a GC method designed for inductive graph inference, RobGC facilitates test-time graph denoising by leveraging the noise-free condensed graph to calibrate the structure of the test graph. Extensive experiments show that RobGC is compatible with various GC methods, significantly boosting their robustness under different types and levels of graph structural noises.
- [136] arXiv:2406.13201 [pdf, html, other]
-
Title: Toward Structure Fairness in Dynamic Graph Embedding: A Trend-aware Dual Debiasing ApproachSubjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Recent studies successfully learned static graph embeddings that are structurally fair by preventing the effectiveness disparity of high- and low-degree vertex groups in downstream graph mining tasks. However, achieving structure fairness in dynamic graph embedding remains an open problem. Neglecting degree changes in dynamic graphs will significantly impair embedding effectiveness without notably improving structure fairness. This is because the embedding performance of high-degree and low-to-high-degree vertices will significantly drop close to the generally poorer embedding performance of most slightly changed vertices in the long-tail part of the power-law distribution. We first identify biased structural evolutions in a dynamic graph based on the evolving trend of vertex degree and then propose FairDGE, the first structurally Fair Dynamic Graph Embedding algorithm. FairDGE learns biased structural evolutions by jointly embedding the connection changes among vertices and the long-short-term evolutionary trend of vertex degrees. Furthermore, a novel dual debiasing approach is devised to encode fair embeddings contrastively, customizing debiasing strategies for different biased structural evolutions. This innovative debiasing strategy breaks the effectiveness bottleneck of embeddings without notable fairness loss. Extensive experiments demonstrate that FairDGE achieves simultaneous improvement in the effectiveness and fairness of embeddings.
- [137] arXiv:2406.13208 [pdf, html, other]
-
Title: Block-level Text Spotting with LLMsComments: 19 pages, 7 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Text spotting has seen tremendous progress in recent years yielding performant techniques which can extract text at the character, word or line level. However, extracting blocks of text from images (block-level text spotting) is relatively unexplored. Blocks contain more context than individual lines, words or characters and so block-level text spotting would enhance downstream applications, such as translation, which benefit from added context. We propose a novel method, BTS-LLM (Block-level Text Spotting with LLMs), to identify text at the block level. BTS-LLM has three parts: 1) detecting and recognizing text at the line level, 2) grouping lines into blocks and 3) finding the best order of lines within a block using a large language model (LLM). We aim to exploit the strong semantic knowledge in LLMs for accurate block-level text spotting. Consequently if the text spotted is semantically meaningful but has been corrupted during text recognition, the LLM is also able to rectify mistakes in the text and produce a reconstruction of it.
- [138] arXiv:2406.13210 [pdf, html, other]
-
Title: Surgical Triplet Recognition via Diffusion ModelSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Surgical triplet recognition is an essential building block to enable next-generation context-aware operating rooms. The goal is to identify the combinations of instruments, verbs, and targets presented in surgical video frames. In this paper, we propose DiffTriplet, a new generative framework for surgical triplet recognition employing the diffusion model, which predicts surgical triplets via iterative denoising. To handle the challenge of triplet association, two unique designs are proposed in our diffusion framework, i.e., association learning and association guidance. During training, we optimize the model in the joint space of triplets and individual components to capture the dependencies among them. At inference, we integrate association constraints into each update of the iterative denoising process, which refines the triplet prediction using the information of individual components. Experiments on the CholecT45 and CholecT50 datasets show the superiority of the proposed method in achieving a new state-of-the-art performance for surgical triplet recognition. Our codes will be released.
- [139] arXiv:2406.13213 [pdf, html, other]
-
Title: Multi-Meta-RAG: Improving RAG for Multi-Hop Queries using Database Filtering with LLM-Extracted MetadataComments: Submitted to ICTERI 2024 Posters TrackSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
The retrieval-augmented generation (RAG) enables retrieval of relevant information from an external knowledge source and allows large language models (LLMs) to answer queries over previously unseen document collections. However, it was demonstrated that traditional RAG applications perform poorly in answering multi-hop questions, which require retrieving and reasoning over multiple elements of supporting evidence. We introduce a new method called Multi-Meta-RAG, which uses database filtering with LLM-extracted metadata to improve the RAG selection of the relevant documents from various sources, relevant to the question. While database filtering is specific to a set of questions from a particular domain and format, we found out that Multi-Meta-RAG greatly improves the results on the MultiHop-RAG benchmark. The code is available at this https URL.
- [140] arXiv:2406.13214 [pdf, html, other]
-
Title: Self-Explainable Temporal Graph Networks based on Graph Information BottleneckComments: KDD 2024Subjects: Machine Learning (cs.LG)
Temporal Graph Neural Networks (TGNN) have the ability to capture both the graph topology and dynamic dependencies of interactions within a graph over time. There has been a growing need to explain the predictions of TGNN models due to the difficulty in identifying how past events influence their predictions. Since the explanation model for a static graph cannot be readily applied to temporal graphs due to its inability to capture temporal dependencies, recent studies proposed explanation models for temporal graphs. However, existing explanation models for temporal graphs rely on post-hoc explanations, requiring separate models for prediction and explanation, which is limited in two aspects: efficiency and accuracy of explanation. In this work, we propose a novel built-in explanation framework for temporal graphs, called Self-Explainable Temporal Graph Networks based on Graph Information Bottleneck (TGIB). TGIB provides explanations for event occurrences by introducing stochasticity in each temporal event based on the Information Bottleneck theory. Experimental results demonstrate the superiority of TGIB in terms of both the link prediction performance and explainability compared to state-of-the-art methods. This is the first work that simultaneously performs prediction and explanation for temporal graphs in an end-to-end manner.
- [141] arXiv:2406.13215 [pdf, html, other]
-
Title: Neural Residual Diffusion Models for Deep Scalable Vision GenerationSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
The most advanced diffusion models have recently adopted increasingly deep stacked networks (e.g., U-Net or Transformer) to promote the generative emergence capabilities of vision generation models similar to large language models (LLMs). However, progressively deeper stacked networks will intuitively cause numerical propagation errors and reduce noisy prediction capabilities on generative data, which hinders massively deep scalable training of vision generation models. In this paper, we first uncover the nature that neural networks being able to effectively perform generative denoising lies in the fact that the intrinsic residual unit has consistent dynamic property with the input signal's reverse diffusion process, thus supporting excellent generative abilities. Afterwards, we stand on the shoulders of two common types of deep stacked networks to propose a unified and massively scalable Neural Residual Diffusion Models framework (Neural-RDM for short), which is a simple yet meaningful change to the common architecture of deep generative networks by introducing a series of learnable gated residual parameters that conform to the generative dynamics. Experimental results on various generative tasks show that the proposed neural residual models obtain state-of-the-art scores on image's and video's generative benchmarks. Rigorous theoretical proofs and extensive experiments also demonstrate the advantages of this simple gated residual mechanism consistent with dynamic modeling in improving the fidelity and consistency of generated content and supporting large-scale scalable training. Code is available at this https URL.
- [142] arXiv:2406.13216 [pdf, html, other]
-
Title: Combining Optimal Transport and Embedding-Based Approaches for More Expressiveness in Unsupervised Graph AlignmentComments: 12 pages,9 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Unsupervised graph alignment finds the one-to-one node correspondence between a pair of attributed graphs by only exploiting graph structure and node features. One category of existing works first computes the node representation and then matches nodes with close embeddings, which is intuitive but lacks a clear objective tailored for graph alignment in the unsupervised setting. The other category reduces the problem to optimal transport (OT) via Gromov-Wasserstein (GW) learning with a well-defined objective but leaves a large room for exploring the design of transport cost. We propose a principled approach to combine their advantages motivated by theoretical analysis of model expressiveness. By noticing the limitation of discriminative power in separating matched and unmatched node pairs, we improve the cost design of GW learning with feature transformation, which enables feature interaction across dimensions. Besides, we propose a simple yet effective embedding-based heuristic inspired by the Weisfeiler-Lehman test and add its prior knowledge to OT for more expressiveness when handling non-Euclidean data. Moreover, we are the first to guarantee the one-to-one matching constraint by reducing the problem to maximum weight matching. The algorithm design effectively combines our OT and embedding-based predictions via stacking, an ensemble learning strategy. We propose a model framework named \texttt{CombAlign} integrating all the above modules to refine node alignment progressively. Through extensive experiments, we demonstrate significant improvements in alignment accuracy compared to state-of-the-art approaches and validate the effectiveness of the proposed modules.
- [143] arXiv:2406.13217 [pdf, html, other]
-
Title: Bridging Law and Data: Augmenting Reasoning via a Semi-Structured Dataset with IRAC methodologySubjects: Computation and Language (cs.CL)
The effectiveness of Large Language Models (LLMs) in legal reasoning is often limited due to the unique legal terminologies and the necessity for highly specialized knowledge. These limitations highlight the need for high-quality data tailored for complex legal reasoning tasks. This paper introduces LEGALSEMI, a benchmark specifically curated for legal scenario analysis. LEGALSEMI comprises 54 legal scenarios, each rigorously annotated by legal experts, based on the comprehensive IRAC (Issue, Rule, Application, Conclusion) framework. In addition, LEGALSEMI is accompanied by a structured knowledge graph (SKG). A series of experiments were conducted to assess the usefulness of LEGALSEMI for IRAC analysis. The experimental results demonstrate the effectiveness of incorporating the SKG for issue identification, rule retrieval, application and conclusion generation using four different LLMs. LEGALSEMI will be publicly available upon acceptance of this paper.
- [144] arXiv:2406.13219 [pdf, html, other]
-
Title: MC-MKE: A Fine-Grained Multimodal Knowledge Editing Benchmark Emphasizing Modality ConsistencySubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Multimodal large language models (MLLMs) are prone to non-factual or outdated knowledge issues, which can manifest as misreading and misrecognition errors due to the complexity of multimodal knowledge. Previous benchmarks have not systematically analyzed the performance of editing methods in correcting these two error types. To better represent and correct these errors, we decompose multimodal knowledge into its visual and textual components. Different error types correspond to different editing formats, which edits distinct part of the multimodal knowledge. We present MC-MKE, a fine-grained Multimodal Knowledge Editing benchmark emphasizing Modality Consistency. Our benchmark facilitates independent correction of misreading and misrecognition errors by editing the corresponding knowledge component. We evaluate three multimodal knowledge editing methods on MC-MKE, revealing their limitations, particularly in terms of modality consistency. Our work highlights the challenges posed by multimodal knowledge editing and motivates further research in developing effective techniques for this task.
- [145] arXiv:2406.13221 [pdf, html, other]
-
Title: Privacy-Preserving Logistic Regression Training on Large DatasetsSubjects: Cryptography and Security (cs.CR)
Privacy-preserving machine learning is one class of cryptographic methods that aim to analyze private and sensitive data while keeping privacy, such as homomorphic logistic regression training over large encrypted data. In this paper, we propose an efficient algorithm for logistic regression training on large encrypted data using Homomorphic Encryption (HE), which is the mini-batch version of recent methods using a faster gradient variant called $\texttt{quadratic gradient}$. It is claimed that $\texttt{quadratic gradient}$ can integrate curve information (Hessian matrix) into the gradient and therefore can effectively accelerate the first-order gradient (descent) algorithms. We also implement the full-batch version of their method when the encrypted dataset is so large that it has to be encrypted in the mini-batch manner. We compare our mini-batch algorithm with our full-batch implementation method on real financial data consisting of 422,108 samples with 200 freatures. %Our experiments show that Nesterov's accelerated gradient (NAG) Given the inefficiency of HEs, our results are inspiring and demonstrate that the logistic regression training on large encrypted dataset is of practical feasibility, marking a significant milestone in our understanding.
- [146] arXiv:2406.13223 [pdf, html, other]
-
Title: Act Better by Timing: A timing-Aware Reinforcement Learning for Autonomous DrivingSubjects: Robotics (cs.RO)
Coping with intensively interactive scenarios is one of the significant challenges in the development of autonomous driving. Reinforcement learning (RL) offers an ideal solution for such scenarios through its self-evolution mechanism via interaction with the environment. However, the lack of sufficient safety mechanisms in common RL leads to the fact that agent often find it difficult to interact well in highly dynamic environment and may collide in pursuit of short-term rewards. Much of the existing safe RL methods require environment modeling to generate reliable safety boundaries that constrain agent behavior. Nevertheless, acquiring such safety boundaries is not always feasible in dynamic environments. Inspired by the driver's behavior of acting when uncertainty is minimal, this study introduces the concept of action timing to replace explicit safety boundary modeling. We define "actor" as an agent to decide optimal action at each step. By imaging the actor take opportunity to act as a timing-dependent gradual process, the other agent called "timing taker" can evaluate the optimal action execution time, and relate the optimal timing to each action moment as a dynamic safety factor to constrain the actor's action. In the experiment involving a complex, unsignaled intersection interaction, this framework achieved superior safety performance compared to all benchmark models.
- [147] arXiv:2406.13225 [pdf, html, other]
-
Title: Communication-Efficient Federated Knowledge Graph Embedding with Entity-Wise Top-K SparsificationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Federated Knowledge Graphs Embedding learning (FKGE) encounters challenges in communication efficiency stemming from the considerable size of parameters and extensive communication rounds. However, existing FKGE methods only focus on reducing communication rounds by conducting multiple rounds of local training in each communication round, and ignore reducing the size of parameters transmitted within each communication round. To tackle the problem, we first find that universal reduction in embedding precision across all entities during compression can significantly impede convergence speed, underscoring the importance of maintaining embedding precision. We then propose bidirectional communication-efficient FedS based on Entity-Wise Top-K Sparsification strategy. During upload, clients dynamically identify and upload only the Top-K entity embeddings with the greater changes to the server. During download, the server first performs personalized embedding aggregation for each client. It then identifies and transmits the Top-K aggregated embeddings to each client. Besides, an Intermittent Synchronization Mechanism is used by FedS to mitigate negative effect of embedding inconsistency among shared entities of clients caused by heterogeneity of Federated Knowledge Graph. Extensive experiments across three datasets showcase that FedS significantly enhances communication efficiency with negligible (even no) performance degradation.
- [148] arXiv:2406.13227 [pdf, html, other]
-
Title: Controllable and Gradual Facial Blemishes Retouching via Physics-Based ModellingChenhao Shuai, Rizhao Cai, Bandara Dissanayake, Amanda Newman, Dayan Guan, Dennis Sng, Ling Li, Alex KotComments: 7 pages, 6 figures. The paper has been accepted by the IEEE Conference on Multimedia Expo 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Face retouching aims to remove facial blemishes, such as pigmentation and acne, and still retain fine-grain texture details. Nevertheless, existing methods just remove the blemishes but focus little on realism of the intermediate process, limiting their use more to beautifying facial images on social media rather than being effective tools for simulating changes in facial pigmentation and ance. Motivated by this limitation, we propose our Controllable and Gradual Face Retouching (CGFR). Our CGFR is based on physical modelling, adopting Sum-of-Gaussians to approximate skin subsurface scattering in a decomposed melanin and haemoglobin color space. Our CGFR offers a user-friendly control over the facial blemishes, achieving realistic and gradual blemishes retouching. Experimental results based on actual clinical data shows that CGFR can realistically simulate the blemishes' gradual recovering process.
- [149] arXiv:2406.13228 [pdf, html, other]
-
Title: AGSOA:Graph Neural Network Targeted Attack Based on Average Gradient and Structure OptimizationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Graph Neural Networks(GNNs) are vulnerable to adversarial attack that cause performance degradation by adding small perturbations to the graph. Gradient-based attacks are one of the most commonly used methods and have achieved good performance in many attack scenarios. However, current gradient attacks face the problems of easy to fall into local optima and poor attack invisibility. Specifically, most gradient attacks use greedy strategies to generate perturbations, which tend to fall into local optima leading to underperformance of the attack. In addition, many attacks only consider the effectiveness of the attack and ignore the invisibility of the attack, making the attacks easily exposed leading to failure. To address the above problems, this paper proposes an attack on GNNs, called AGSOA, which consists of an average gradient calculation and a structre optimization module. In the average gradient calculation module, we compute the average of the gradient information over all moments to guide the attack to generate perturbed edges, which stabilizes the direction of the attack update and gets rid of undesirable local maxima. In the structure optimization module, we calculate the similarity and homogeneity of the target node's with other nodes to adjust the graph structure so as to improve the invisibility and transferability of the attack. Extensive experiments on three commonly used datasets show that AGSOA improves the misclassification rate by 2$\%$-8$\%$ compared to other state-of-the-art models.
- [150] arXiv:2406.13229 [pdf, html, other]
-
Title: Probing the Emergence of Cross-lingual Alignment during LLM TrainingComments: Accepted to Findings of the Association for Computational Linguistics: ACL 2024Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Multilingual Large Language Models (LLMs) achieve remarkable levels of zero-shot cross-lingual transfer performance. We speculate that this is predicated on their ability to align languages without explicit supervision from parallel sentences. While representations of translationally equivalent sentences in different languages are known to be similar after convergence, however, it remains unclear how such cross-lingual alignment emerges during pre-training of LLMs. Our study leverages intrinsic probing techniques, which identify which subsets of neurons encode linguistic features, to correlate the degree of cross-lingual neuron overlap with the zero-shot cross-lingual transfer performance for a given model. In particular, we rely on checkpoints of BLOOM, a multilingual autoregressive LLM, across different training steps and model scales. We observe a high correlation between neuron overlap and downstream performance, which supports our hypothesis on the conditions leading to effective cross-lingual transfer. Interestingly, we also detect a degradation of both implicit alignment and multilingual abilities in certain phases of the pre-training process, providing new insights into the multilingual pretraining dynamics.
- [151] arXiv:2406.13230 [pdf, html, other]
-
Title: Enhancing Language Model Factuality via Activation-Based Confidence Calibration and Guided DecodingSubjects: Computation and Language (cs.CL)
Calibrating language models (LMs) aligns their generation confidence with the actual likelihood of answer correctness, which can inform users about LMs' reliability and mitigate hallucinated content. However, prior calibration methods, such as self-consistency-based and logit-based approaches, are either limited in inference-time efficiency or fall short of providing informative signals. Moreover, simply filtering out low-confidence responses reduces the LM's helpfulness when the answers are correct. Therefore, effectively using calibration techniques to enhance an LM's factuality remains an unsolved challenge. In this paper, we first propose an activation-based calibration method, ActCab, which trains a linear layer on top of the LM's last-layer activations that can better capture the representations of knowledge. Built on top of ActCab, we further propose CoDec, a confidence-guided decoding strategy to elicit truthful answers with high confidence from LMs. By evaluating on five popular QA benchmarks, ActCab achieves superior calibration performance than all competitive baselines, e.g., by reducing the average expected calibration error (ECE) score by up to 39%. Further experiments on CoDec show consistent improvements in several LMs' factuality on challenging QA datasets, such as TruthfulQA, highlighting the value of confidence signals in enhancing factuality.
- [152] arXiv:2406.13231 [pdf, html, other]
-
Title: Tight Lower Bounds for Directed Cut Sparsification and Distributed Min-CutSubjects: Data Structures and Algorithms (cs.DS)
In this paper, we consider two fundamental cut approximation problems on large graphs. We prove new lower bounds for both problems that are optimal up to logarithmic factors.
The first problem is to approximate cuts in balanced directed graphs. In this problem, the goal is to build a data structure that $(1 \pm \epsilon)$-approximates cut values in graphs with $n$ vertices. For arbitrary directed graphs, such a data structure requires $\Omega(n^2)$ bits even for constant $\epsilon$. To circumvent this, recent works study $\beta$-balanced graphs, meaning that for every directed cut, the total weight of edges in one direction is at most $\beta$ times that in the other direction. We consider two models: the {\em for-each} model, where the goal is to approximate each cut with constant probability, and the {\em for-all} model, where all cuts must be preserved simultaneously. We improve the previous $\Omega(n \sqrt{\beta/\epsilon})$ lower bound to $\tilde{\Omega}(n \sqrt{\beta}/\epsilon)$ in the for-each model, and we improve the previous $\Omega(n \beta/\epsilon)$ lower bound to $\Omega(n \beta/\epsilon^2)$ in the for-all model. This resolves the main open questions of (Cen et al., ICALP, 2021).
The second problem is to approximate the global minimum cut in a local query model, where we can only access the graph via degree, edge, and adjacency queries. We improve the previous $\Omega\bigl(\frac{m}{k}\bigr)$ query complexity lower bound to $\Omega\bigl(\min\{m, \frac{m}{\epsilon^2 k}\}\bigr)$ for this problem, where $m$ is the number of edges, $k$ is the size of the minimum cut, and we seek a $(1+\epsilon)$-approximation. In addition, we show that existing upper bounds with slight modifications match our lower bound up to logarithmic factors. - [153] arXiv:2406.13232 [pdf, other]
-
Title: Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language ModelsComments: 22 pages, 13 tables, 7 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Open Domain Question Answering (ODQA) within natural language processing involves building systems that answer factual questions using large-scale knowledge corpora. Recent advances stem from the confluence of several factors, such as large-scale training datasets, deep learning techniques, and the rise of large language models. High-quality datasets are used to train models on realistic scenarios and enable the evaluation of the system on potentially unseen data. Standardized metrics facilitate comparisons between different ODQA systems, allowing researchers to objectively track advancements in the field. Our study presents a thorough examination of the current landscape of ODQA benchmarking by reviewing 52 datasets and 20 evaluation techniques across textual and multimodal modalities. We introduce a novel taxonomy for ODQA datasets that incorporates both the modality and difficulty of the question types. Additionally, we present a structured organization of ODQA evaluation metrics along with a critical analysis of their inherent trade-offs. Our study aims to empower researchers by providing a framework for the robust evaluation of modern question-answering systems. We conclude by identifying the current challenges and outlining promising avenues for future research and development.
- [154] arXiv:2406.13233 [pdf, html, other]
-
Title: AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language ModelsSubjects: Artificial Intelligence (cs.AI)
Mixture of experts (MoE) has become the standard for constructing production-level large language models (LLMs) due to its promise to boost model capacity without causing significant overheads. Nevertheless, existing MoE methods usually enforce a constant top-k routing for all tokens, which is arguably restrictive because various tokens (e.g., "<EOS>" vs. "apple") may require various numbers of experts for feature abstraction. Lifting such a constraint can help make the most of limited resources and unleash the potential of the model for downstream tasks. In this sense, we introduce AdaMoE to realize token-adaptive routing for MoE, where different tokens are permitted to select a various number of experts. AdaMoE makes minimal modifications to the vanilla MoE with top-k routing -- it simply introduces a fixed number of null experts, which do not consume any FLOPs, to the expert set and increases the value of k. AdaMoE does not force each token to occupy a fixed number of null experts but ensures the average usage of the null experts with a load-balancing loss, leading to an adaptive number of null/true experts used by each token. AdaMoE exhibits a strong resemblance to MoEs with expert choice routing while allowing for trivial auto-regressive modeling. AdaMoE is easy to implement and can be effectively applied to pre-trained (MoE-)LLMs. Extensive studies show that AdaMoE can reduce average expert load (FLOPs) while achieving superior performance. For example, on the ARC-C dataset, applying our method to fine-tuning Mixtral-8x7B can reduce FLOPs by 14.5% while increasing accuracy by 1.69%.
- [155] arXiv:2406.13235 [pdf, html, other]
-
Title: Enhancing Collaborative Semantics of Language Model-Driven Recommendations via Graph-Aware LearningComments: 10pagesSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Large Language Models (LLMs) are increasingly prominent in the recommendation systems domain. Existing studies usually utilize in-context learning or supervised fine-tuning on task-specific data to align LLMs into recommendations. However, the substantial bias in semantic spaces between language processing tasks and recommendation tasks poses a nonnegligible challenge. Specifically, without the adequate capturing ability of collaborative information, existing modeling paradigms struggle to capture behavior patterns within community groups, leading to LLMs' ineffectiveness in discerning implicit interaction semantic in recommendation scenarios. To address this, we consider enhancing the learning capability of language model-driven recommendation models for structured data, specifically by utilizing interaction graphs rich in collaborative semantics. We propose a Graph-Aware Learning for Language Model-Driven Recommendations (GAL-Rec). GAL-Rec enhances the understanding of user-item collaborative semantics by imitating the intent of Graph Neural Networks (GNNs) to aggregate multi-hop information, thereby fully exploiting the substantial learning capacity of LLMs to independently address the complex graphs in the recommendation system. Sufficient experimental results on three real-world datasets demonstrate that GAL-Rec significantly enhances the comprehension of collaborative semantics, and improves recommendation performance.
- [156] arXiv:2406.13236 [pdf, html, other]
-
Title: Data Contamination Can Cross Language BarriersComments: 12 pages, 5 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks in the pre-training data. Existing contamination detection methods are typically based on the text overlap between training and evaluation data, which can be too superficial to reflect deeper forms of contamination. In this paper, we first present a cross-lingual form of contamination that inflates LLMs' performance while evading current detection methods, deliberately injected by overfitting LLMs on the translated versions of benchmark test sets. Then, we propose generalization-based approaches to unmask such deeply concealed contamination. Specifically, we examine the LLM's performance change after modifying the original benchmark by replacing the false answer choices with correct ones from other questions. Contaminated models can hardly generalize to such easier situations, where the false choices can be \emph{not even wrong}, as all choices are correct in their memorization. Experimental results demonstrate that cross-lingual contamination can easily fool existing detection methods, but not ours. In addition, we discuss the potential utilization of cross-lingual contamination in interpreting LLMs' working mechanisms and in post-training LLMs for enhanced multilingual capabilities. The code and dataset we use can be obtained from \url{this https URL}.
- [157] arXiv:2406.13237 [pdf, html, other]
-
Title: ModelMix: A New Model-Mixup Strategy to Minimize Vicinal Risk across Tasks for Few-scribble based Cardiac SegmentationComments: 10 pages, 3 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Pixel-level dense labeling is both resource-intensive and time-consuming, whereas weak labels such as scribble present a more feasible alternative to full annotations. However, training segmentation networks with weak supervision from scribbles remains challenging. Inspired by the fact that different segmentation tasks can be correlated with each other, we introduce a new approach to few-scribble supervised segmentation based on model parameter interpolation, termed as ModelMix. Leveraging the prior knowledge that linearly interpolating convolution kernels and bias terms should result in linear interpolations of the corresponding feature vectors, ModelMix constructs virtual models using convex combinations of convolutional parameters from separate encoders. We then regularize the model set to minimize vicinal risk across tasks in both unsupervised and scribble-supervised way. Validated on three open datasets, i.e., ACDC, MSCMRseg, and MyoPS, our few-scribble guided ModelMix significantly surpasses the performance of the state-of-the-art scribble supervised methods.
- [158] arXiv:2406.13242 [pdf, html, other]
-
Title: MagicItem: Dynamic Behavior Design of Virtual Objects with Large Language Models in a Consumer Metaverse PlatformRyutaro Kurai, Takefumi Hiraki, Yuichi Hiroi, Yutaro Hirao, Monica Perusquia-Hernandez, Hideaki Uchiyama, Kiyoshi KiyokawaSubjects: Human-Computer Interaction (cs.HC)
To create rich experiences in virtual reality (VR) environments, it is essential to define the behavior of virtual objects through programming. However, programming in 3D spaces requires a wide range of background knowledge and programming skills. Although Large Language Models (LLMs) have provided programming support, they are still primarily aimed at programmers. In metaverse platforms, where many users inhabit VR spaces, most users are unfamiliar with programming, making it difficult for them to modify the behavior of objects in the VR environment easily. Existing LLM-based script generation methods for VR spaces require multiple lengthy iterations to implement the desired behaviors and are difficult to integrate into the operation of metaverse platforms. To address this issue, we propose a tool that generates behaviors for objects in VR spaces from natural language within Cluster, a metaverse platform with a large user base. By integrating LLMs with the Cluster Script provided by this platform, we enable users with limited programming experience to define object behaviors within the platform freely. We have also integrated our tool into a commercial metaverse platform and are conducting online experiments with 63 general users of the platform. The experiments show that even users with no programming background can successfully generate behaviors for objects in VR spaces, resulting in a highly satisfying system. Our research contributes to democratizing VR content creation by enabling non-programmers to design dynamic behaviors for virtual objects in metaverse platforms.
- [159] arXiv:2406.13243 [pdf, html, other]
-
Title: Abelian Group Codes for Classical and Classical-Quantum Channels: One-shot and Asymptotic Rate BoundsComments: 41 pagesSubjects: Information Theory (cs.IT)
We study the problem of transmission of information over classical and classical-quantum channels in the one-shot regime where the underlying codes are constrained to be group codes. In the achievability part, we introduce a new input probability distribution that incorporates the encoding homomorphism and the underlying channel law. Using a random coding argument, we characterize the performance of group codes in terms of hypothesis testing relative-entropic quantities. In the converse part, we establish bounds by leveraging a hypothesis testing-based approach. Furthermore, we apply the one-shot result to the asymptotic stationary memoryless setting, and establish a single-letter lower bound on group capacities for both classes of channels. Moreover, we derive a matching upper bound on the asymptotic group capacity.
- [160] arXiv:2406.13246 [pdf, html, other]
-
Title: GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMsSubjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
The ability to understand and reason about spatial relationships between objects in images is an important component of visual reasoning. This skill rests on the ability to recognize and localize objects of interest and determine their spatial relation. Early vision and language models (VLMs) have been shown to struggle to recognize spatial relations. We extend the previously released What'sUp dataset and propose a novel comprehensive evaluation for spatial relationship understanding that highlights the strengths and weaknesses of 27 different models. In addition to the VLMs evaluated in What'sUp, our extensive evaluation encompasses 3 classes of Multimodal LLMs (MLLMs) that vary in their parameter sizes (ranging from 7B to 110B), training/instruction-tuning methods, and visual resolution to benchmark their performances and scrutinize the scaling laws in this task.
- [161] arXiv:2406.13247 [pdf, html, other]
-
Title: A Match Made in Semantics: Physics-infused Digital Twins for Smart Building AutomationSubjects: Other Computer Science (cs.OH)
Buildings contain electro-mechanical systems that ensure the occupants' comfort, health, and safety. The functioning of these systems is automated through control programs, which are often available as reusable artifacts in a software library. However, matching these reusable control programs to the installed technical systems requires manual effort and adds engineering cost. In this article, we show that such matching can be accomplished fully automatically through logical rules and based on the creation of semantic relationships between descriptions of \emph{physical processes} and descriptions of technical systems and control programs. For this purpose, we propose a high-level bridging ontology that enables the desired rule-based matching and equips digital twins of the technical systems with the required knowledge about the underlying physical processes in a self-contained manner. We evaluated our approach in a real-life building automation project with a total of 34 deployed air handling units. Our data show that rules based on our bridging ontology enabled the system to infer the suitable choice of control programs automatically in more than 90\% of the cases while avoiding almost an hour of manual work for each such match.
- [162] arXiv:2406.13248 [pdf, html, other]
-
Title: Overlay Space-Air-Ground Integrated Networks with SWIPT-Empowered Aerial CommunicationsComments: 36 pages, 14 figures, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
In this article, we consider overlay space-air-ground integrated networks (OSAGINs) where a low earth orbit (LEO) satellite communicates with ground users (GUs) with the assistance of an energy-constrained coexisting air-to-air (A2A) network. Particularly, a non-linear energy harvester with a hybrid SWIPT utilizing both power-splitting and time-switching energy harvesting (EH) techniques is employed at the aerial transmitter. Specifically, we take the random locations of the satellite, ground and aerial receivers to investigate the outage performance of both the satellite-to-ground and aerial networks leveraging the stochastic tools. By taking into account the Shadowed-Rician fading for satellite link, the Nakagami-\emph{m} for ground link, and the Rician fading for aerial link, we derive analytical expressions for the outage probability of these networks. For a comprehensive analysis of aerial network, we consider both the perfect and imperfect successive interference cancellation (SIC) scenarios. Through our analysis, we illustrate that, unlike linear EH, the implementation of non-linear EH provides accurate figures for any target rate, underscoring the significance of using non-linear EH models. Additionally, the influence of key parameters is emphasized, providing guidelines for the practical design of an energy-efficient as well as spectrum-efficient future non-terrestrial networks. Monte Carlo simulations validate the accuracy of our theoretical developments.
- [163] arXiv:2406.13249 [pdf, html, other]
-
Title: R^2AG: Incorporating Retrieval Information into Retrieval Augmented GenerationSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Retrieval augmented generation (RAG) has been applied in many scenarios to augment large language models (LLMs) with external documents provided by retrievers. However, a semantic gap exists between LLMs and retrievers due to differences in their training objectives and architectures. This misalignment forces LLMs to passively accept the documents provided by the retrievers, leading to incomprehension in the generation process, where the LLMs are burdened with the task of distinguishing these documents using their inherent knowledge. This paper proposes R$^2$AG, a novel enhanced RAG framework to fill this gap by incorporating Retrieval information into Retrieval Augmented Generation. Specifically, R$^2$AG utilizes the nuanced features from the retrievers and employs a R$^2$-Former to capture retrieval information. Then, a retrieval-aware prompting strategy is designed to integrate retrieval information into LLMs' generation. Notably, R$^2$AG suits low-source scenarios where LLMs and retrievers are frozen. Extensive experiments across five datasets validate the effectiveness, robustness, and efficiency of R$^2$AG. Our analysis reveals that retrieval information serves as an anchor to aid LLMs in the generation process, thereby filling the semantic gap.
- [164] arXiv:2406.13250 [pdf, html, other]
-
Title: LangTopo: Aligning Language Descriptions of Graphs with Tokenized Topological ModelingSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Recently, large language models (LLMs) have been widely researched in the field of graph machine learning due to their outstanding abilities in language comprehension and learning. However, the significant gap between natural language tasks and topological structure modeling poses a nonnegligible challenge. Specifically, since natural language descriptions are not sufficient for LLMs to understand and process graph-structured data, fine-tuned LLMs perform even worse than some traditional GNN models on graph tasks, lacking inherent modeling capabilities for graph structures. Existing research overly emphasizes LLMs' understanding of semantic information captured by external models, while inadequately exploring graph topological structure modeling, thereby overlooking the genuine capabilities that LLMs lack. Consequently, in this paper, we introduce a new framework, LangTopo, which aligns graph structure modeling with natural language understanding at the token level. LangTopo quantifies the graph structure modeling capabilities of GNNs and LLMs by constructing a codebook for the graph modality and performs consistency maximization. This process aligns the text description of LLM with the topological modeling of GNN, allowing LLM to learn the ability of GNN to capture graph structures, enabling LLM to handle graph-structured data independently. We demonstrate the effectiveness of our proposed method on multiple datasets.
- [165] arXiv:2406.13251 [pdf, html, other]
-
Title: Freq-Mip-AA : Frequency Mip Representation for Anti-Aliasing Neural Radiance FieldsComments: Accepted to ICIP 2024, 7 pages, 3 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV)
Neural Radiance Fields (NeRF) have shown remarkable success in representing 3D scenes and generating novel views. However, they often struggle with aliasing artifacts, especially when rendering images from different camera distances from the training views. To address the issue, Mip-NeRF proposed using volumetric frustums to render a pixel and suggested integrated positional encoding (IPE). While effective, this approach requires long training times due to its reliance on MLP architecture. In this work, we propose a novel anti-aliasing technique that utilizes grid-based representations, usually showing significantly faster training time. In addition, we exploit frequency-domain representation to handle the aliasing problem inspired by the sampling theorem. The proposed method, FreqMipAA, utilizes scale-specific low-pass filtering (LPF) and learnable frequency masks. Scale-specific low-pass filters (LPF) prevent aliasing and prioritize important image details, and learnable masks effectively remove problematic high-frequency elements while retaining essential information. By employing a scale-specific LPF and trainable masks, FreqMipAA can effectively eliminate the aliasing factor while retaining important details. We validated the proposed technique by incorporating it into a widely used grid-based method. The experimental results have shown that the FreqMipAA effectively resolved the aliasing issues and achieved state-of-the-art results in the multi-scale Blender dataset. Our code is available at this https URL .
- [166] arXiv:2406.13253 [pdf, html, other]
-
Title: Smart Contracts in the Real World: A Statistical Exploration of External Data DependenciesSubjects: Cryptography and Security (cs.CR)
Smart contracts are pivotal for implementing various functions due to their interactivity with external data. However, this interactivity also presents challenges in terms of security and reliability. There is a lack of statistical and quantitative research on the interaction between smart contracts and external data. To fill this gap, we thoroughly examine 10,500 actual smart contracts to select 9,356 valid samples, excluding those that are outdated or have compilation errors. Utilizing code parsing techniques, the study transformed contract code into Abstract Syntax Trees (ASTs) and extracted keywords related to external data dependency through code analysis. By comparing the ASTs with the keyword list, we conduct a quantitative analysis of the number and proportion of contracts involving external data interaction. Furthermore, we collect over 3,600 security audit reports and manually filter 249 (approximately 9%) reports related to external data interaction, categorizing the external data dependency in these contracts. We also explore the relationship between the complexity of smart contracts and their dependence on external data.
- [167] arXiv:2406.13256 [pdf, html, other]
-
Title: Winning Through Simplicity: Autonomous Car Design for Formula StudentTobias Friedrich, Marco Müller, Adrian Bauske, Simon Härtl, Johannes Herrmann, David Förster, Tobias Tietze, Sebastian SartorSubjects: Robotics (cs.RO)
This paper presents the design of an autonomous race car that is self-designed, self-developed, and self-built by the Elefant Racing team at the University of Bayreuth. The system is created to compete in the Formula Student Driverless competition. Its primary focus is on the Acceleration track, a straight 75-meter-long course, and the Skidpad track, which comprises two circles forming an eight. Additionally, it is experimentally capable of competing in the Autocross and Trackdrive events, which feature tracks with previously unknown straights and curves. The paper details the hardware, software and sensor setup employed during the 2020/2021 season. Despite being developed by a small team with limited computer science expertise, the design won the Formula Student East Engineering Design award. Emphasizing simplicity and efficiency, the team employed streamlined techniques to achieve their success.
- [168] arXiv:2406.13257 [pdf, html, other]
-
Title: Reasoning with trees: interpreting CNNs using hierarchiesSubjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Challenges persist in providing interpretable explanations for neural network reasoning in explainable AI (xAI). Existing methods like Integrated Gradients produce noisy maps, and LIME, while intuitive, may deviate from the model's reasoning. We introduce a framework that uses hierarchical segmentation techniques for faithful and interpretable explanations of Convolutional Neural Networks (CNNs). Our method constructs model-based hierarchical segmentations that maintain the model's reasoning fidelity and allows both human-centric and model-centric segmentation. This approach offers multiscale explanations, aiding bias identification and enhancing understanding of neural network decision-making. Experiments show that our framework, xAiTrees, delivers highly interpretable and faithful model explanations, not only surpassing traditional xAI methods but shedding new light on a novel approach to enhancing xAI interpretability. Code at: this https URL .
- [169] arXiv:2406.13258 [pdf, other]
-
Title: Applications of Post-quantum CryptographyComments: Proceedings of the 23rd European Conference on Cyber Warfare and Security (ECCWS)Subjects: Cryptography and Security (cs.CR); Emerging Technologies (cs.ET)
With the constantly advancing capabilities of quantum computers, conventional cryptographic systems relying on complex math problems may encounter unforeseen vulnerabilities. Unlike regular computers, which are often deemed cost-ineffective in cryptographic attacks, quantum computers have a significant advantage in calculation speed. This distinction potentially makes currently used algorithms less secure or even completely vulnerable, compelling the exploration of post-quantum cryptography (PQC) as the most reasonable solution to quantum threats. This review aims to provide current information on applications, benefits, and challenges associated with the PQC. The review employs a systematic scoping review with the scope restricted to the years 2022 and 2023; only articles that were published in scientific journals were used in this paper. The review examined the articles on the applications of quantum computing in various spheres. However, the scope of this paper was restricted to the domain of the PQC because most of the analyzed articles featured this field. Subsequently, the paper is analyzing various PQC algorithms, including lattice-based, hash-based, code-based, multivariate polynomial, and isogeny-based cryptography. Each algorithm is being judged based on its potential applications, robustness, and challenges. All the analyzed algorithms are promising for the post-quantum era in such applications as digital signatures, communication channels, and IoT. Moreover, some of the algorithms are already implemented in the spheres of banking transactions, communication, and intellectual property. Meanwhile, despite their potential, these algorithms face serious challenges since they lack standardization, require vast amounts of storage and computation power, and might have unknown vulnerabilities that can be discovered only with years of cryptanalysis.
- [170] arXiv:2406.13259 [pdf, other]
-
Title: Cyber Protection Applications of Quantum Computing: A ReviewComments: Proceedings of the 23rd European Conference on Cyber Warfare and Security (ECCWS)Subjects: Cryptography and Security (cs.CR); Emerging Technologies (cs.ET)
Quantum computing is a cutting-edge field of information technology that harnesses the principles of quantum mechanics to perform computations. It has major implications for the cyber security industry. Existing cyber protection applications are working well, but there are still challenges and vulnerabilities in computer networks. Sometimes data and privacy are also compromised. These complications lead to research questions asking what kind of cyber protection applications of quantum computing are there and what potential methods or techniques can be used for cyber protection? These questions will reveal how much power quantum computing has and to what extent it can outperform the conventional computing systems. This scoping review was conducted by considering 815 papers. It showed the possibilities that can be achievedif quantum technologies are implemented in cyber environments. This scoping review discusses various domains such as algorithms and applications, bioinformatics, cloud and edge computing, the organization of complex systems, application areas focused on security and threats, and the broader quantum computing ecosystem. In each of these areas, there is significant scope for quantum computing to be implemented and to revolutionize the working environment. Numerous quantum computing applications for cyber protection and a number of techniques to protect our data and privacy were identified. The results are not limited to network security but also include data security. This paper also discusses societal aspects, e.g., the applications of quantum computing in the social sciences. This scoping review discusses how to enhance the efficiency and security of quantum computing in various cyber security domains. Additionally, it encourages the reader to think about what kind of techniques and methods can be deployed to secure the cyber world.
- [171] arXiv:2406.13260 [pdf, other]
-
Title: Hoop Diagrams: A Set Visualization MethodSubjects: Graphics (cs.GR)
We introduce Hoop Diagrams, a new visualization technique for set data. Hoop Diagrams are a circular visualization with hoops representing sets and sectors representing set intersections. We present an interactive tool for drawing Hoop Diagrams and describe a user study comparing them with Linear Diagrams. The results show only small differences, with users answering questions more quickly with Linear Diagrams, but answering some questions more accurately with Hoop Diagrams. Interaction data indicates that those using set order and intersection highlighting were more successful at answering questions, but those who used other interactions had a slower response. The similarity in usability suggests that the diagram type should be chosen based on the presentation method. Linear Diagrams increase in the horizontal direction with the number of intersections, leading to difficulties fitting on a screen. Hoop Diagrams al-ways have a square aspect ratio.
- [172] arXiv:2406.13261 [pdf, html, other]
-
Title: BeHonest: Benchmarking Honesty of Large Language ModelsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Previous works on Large Language Models (LLMs) have mainly focused on evaluating their helpfulness or harmlessness. However, honesty, another crucial alignment criterion, has received relatively less attention. Dishonest behaviors in LLMs, such as spreading misinformation and defrauding users, eroding user trust, and causing real-world harm, present severe risks that intensify as these models approach superintelligence levels. Enhancing honesty in LLMs addresses critical deficiencies and helps uncover latent capabilities that are not readily expressed. This underscores the urgent need for reliable methods and benchmarks to effectively ensure and evaluate the honesty of LLMs.
In this paper, we introduce BeHonest, a pioneering benchmark specifically designed to assess honesty in LLMs comprehensively. BeHonest evaluates three essential aspects of honesty: awareness of knowledge boundaries, avoidance of deceit, and consistency in responses. Building on this foundation, we designed 10 scenarios to evaluate and analyze 9 popular LLMs on the market, including both closed-source and open-source models from different model families with varied model sizes. Our findings indicate that there is still significant room for improvement in the honesty of LLMs. We also encourage the AI community to prioritize honesty alignment in LLMs. Our benchmark and code can be found at: \url{this https URL}. - [173] arXiv:2406.13262 [pdf, other]
-
Title: Machine Learning Applications of Quantum Computing: A ReviewComments: Proceedings of the 23rd European Conference on Cyber Warfare and Security (ECCWS)Subjects: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
At the intersection of quantum computing and machine learning, this review paper explores the transformative impact these technologies are having on the capabilities of data processing and analysis, far surpassing the bounds of traditional computational methods. Drawing upon an in-depth analysis of 32 seminal papers, this review delves into the interplay between quantum computing and machine learning, focusing on transcending the limitations of classical computing in advanced data processing and applications. This review emphasizes the potential of quantum-enhanced methods in enhancing cybersecurity, a critical sector that stands to benefit significantly from these advancements. The literature review, primarily leveraging Science Direct as an academic database, delves into the transformative effects of quantum technologies on machine learning, drawing insights from a diverse collection of studies and scholarly articles. While the focus is primarily on the growing significance of quantum computing in cybersecurity, the review also acknowledges the promising implications for other sectors as the field matures. Our systematic approach categorizes sources based on quantum machine learning algorithms, applications, challenges, and potential future developments, uncovering that quantum computing is increasingly being implemented in practical machine learning scenarios. The review highlights advancements in quantum-enhanced machine learning algorithms and their potential applications in sectors such as cybersecurity, emphasizing the need for industry-specific solutions while considering ethical and security concerns. By presenting an overview of the current state and projecting future directions, the paper sets a foundation for ongoing research and strategic advancement in quantum machine learning.
- [174] arXiv:2406.13264 [pdf, html, other]
-
Title: Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management TasksMichael Wornow, Avanika Narayan, Ben Viggiano, Ishan S. Khare, Tathagat Verma, Tibor Thompson, Miguel Angel Fuentes Hernandez, Sudharsan Sundar, Chloe Trujillo, Krrish Chawla, Rongfei Lu, Justin Shen, Divya Nagaraj, Joshua Martinez, Vardhan Agrawal, Althea Hudson, Nigam H. Shah, Christopher ReSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
Existing ML benchmarks lack the depth and diversity of annotations needed for evaluating models on business process management (BPM) tasks. BPM is the practice of documenting, measuring, improving, and automating enterprise workflows. However, research has focused almost exclusively on one task - full end-to-end automation using agents based on multimodal foundation models (FMs) like GPT-4. This focus on automation ignores the reality of how most BPM tools are applied today - simply documenting the relevant workflow takes 60% of the time of the typical process optimization project. To address this gap we present WONDERBREAD, the first benchmark for evaluating multimodal FMs on BPM tasks beyond automation. Our contributions are: (1) a dataset containing 2928 documented workflow demonstrations; (2) 6 novel BPM tasks sourced from real-world applications ranging from workflow documentation to knowledge transfer to process improvement; and (3) an automated evaluation harness. Our benchmark shows that while state-of-the-art FMs can automatically generate documentation (e.g. recalling 88% of the steps taken in a video demonstration of a workflow), they struggle to re-apply that knowledge towards finer-grained validation of workflow completion (F1 < 0.3). We hope WONDERBREAD encourages the development of more "human-centered" AI tooling for enterprise applications and furthers the exploration of multimodal FMs for the broader universe of BPM tasks. We publish our dataset and experiments here: this https URL
- [175] arXiv:2406.13265 [pdf, html, other]
-
Title: Molecule Graph Networks with Many-body Equivariant InteractionsSubjects: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
Message passing neural networks have demonstrated significant efficacy in predicting molecular interactions. Introducing equivariant vectorial representations augments expressivity by capturing geometric data symmetries, thereby improving model accuracy. However, two-body bond vectors in opposition may cancel each other out during message passing, leading to the loss of directional information on their shared node. In this study, we develop Equivariant N-body Interaction Networks (ENINet) that explicitly integrates equivariant many-body interactions to preserve directional information in the message passing scheme. Experiments indicate that integrating many-body equivariant representations enhances prediction accuracy across diverse scalar and tensorial quantum chemical properties. Ablation studies show an average performance improvement of 7.9% across 11 out of 12 properties in QM9, 27.9% in forces in MD17, and 11.3% in polarizabilities (CCSD) in QM7b.
- [176] arXiv:2406.13267 [pdf, html, other]
-
Title: The Kinetics Observer: A Tightly Coupled Estimator for Legged RobotsArnaud Demont (CNRS-AIST JRL, LISV), Mehdi Benallegue (CNRS-AIST JRL), Abdelaziz Benallegue (LISV, UVSQ), Pierre Gergondet (CNRS-AIST JRL), Antonin Dallard (LIRMM), Rafael Cisneros (CNRS-AIST JRL), Masaki Murooka (CNRS-AIST JRL), Fumio Kanehiro (CNRS-AIST JRL)Subjects: Robotics (cs.RO)
In this paper, we propose the "Kinetics Observer", a novel estimator addressing the challenge of state estimation for legged robots using proprioceptive sensors (encoders, IMU and force/torque sensors). Based on a Multiplicative Extended Kalman Filter, the Kinetics Observer allows the real-time simultaneous estimation of contact and perturbation forces, and of the robot's kinematics, which are accurate enough to perform proprioceptive odometry. Thanks to a visco-elastic model of the contacts linking their kinematics to the ones of the centroid of the robot, the Kinetics Observer ensures a tight coupling between the whole-body kinematics and dynamics of the robot. This coupling entails a redundancy of the measurements that enhances the robustness and the accuracy of the estimation. This estimator was tested on two humanoid robots performing long distance walking on even terrain and non-coplanar multi-contact locomotion.
- [177] arXiv:2406.13269 [pdf, other]
-
Title: Investigating Low-Cost LLM Annotation for~Spoken Dialogue Understanding DatasetsJournal-ref: 27th International Conference on Text, Speech and Dialogue, Sep 2024, Brno (R{\'e}p. Tch{\`e}que), Czech RepublicSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
In spoken Task-Oriented Dialogue (TOD) systems, the choice of the semantic representation describing the users' requests is key to a smooth interaction. Indeed, the system uses this representation to reason over a database and its domain knowledge to choose its next action. The dialogue course thus depends on the information provided by this semantic representation. While textual datasets provide fine-grained semantic representations, spoken dialogue datasets fall behind. This paper provides insights into automatic enhancement of spoken dialogue datasets' semantic representations. Our contributions are three fold: (1) assess the relevance of Large Language Model fine-tuning, (2) evaluate the knowledge captured by the produced annotations and (3) highlight semi-automatic annotation implications.
- [178] arXiv:2406.13271 [pdf, html, other]
-
Title: Hierarchical IoU Tracking based on IntervalComments: 7 pages, 3 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Multi-Object Tracking (MOT) aims to detect and associate all targets of given classes across frames. Current dominant solutions, e.g. ByteTrack and StrongSORT++, follow the hybrid pipeline, which first accomplish most of the associations in an online manner, and then refine the results using offline tricks such as interpolation and global link. While this paradigm offers flexibility in application, the disjoint design between the two stages results in suboptimal performance. In this paper, we propose the Hierarchical IoU Tracking framework, dubbed HIT, which achieves unified hierarchical tracking by utilizing tracklet intervals as priors. To ensure the conciseness, only IoU is utilized for association, while discarding the heavy appearance models, tricky auxiliary cues, and learning-based association modules. We further identify three inconsistency issues regarding target size, camera movement and hierarchical cues, and design corresponding solutions to guarantee the reliability of associations. Though its simplicity, our method achieves promising performance on four datasets, i.e., MOT17, KITTI, DanceTrack and VisDrone, providing a strong baseline for future tracking method design. Moreover, we experiment on seven trackers and prove that HIT can be seamlessly integrated with other solutions, whether they are motion-based, appearance-based or learning-based. Our codes will be released at this https URL.
- [179] arXiv:2406.13272 [pdf, html, other]
-
Title: AniFaceDiff: High-Fidelity Face Reenactment via Facial Parametric Conditioned Diffusion ModelsKen Chen, Sachith Seneviratne, Wei Wang, Dongting Hu, Sanjay Saha, Md. Tarek Hasan, Sanka Rasnayaka, Tamasha Malepathirana, Mingming Gong, Saman HalgamugeSubjects: Computer Vision and Pattern Recognition (cs.CV)
Face reenactment refers to the process of transferring the pose and facial expressions from a reference (driving) video onto a static facial (source) image while maintaining the original identity of the source image. Previous research in this domain has made significant progress by training controllable deep generative models to generate faces based on specific identity, pose and expression conditions. However, the mechanisms used in these methods to control pose and expression often inadvertently introduce identity information from the driving video, while also causing a loss of expression-related details. This paper proposes a new method based on Stable Diffusion, called AniFaceDiff, incorporating a new conditioning module for high-fidelity face reenactment. First, we propose an enhanced 2D facial snapshot conditioning approach by facial shape alignment to prevent the inclusion of identity information from the driving video. Then, we introduce an expression adapter conditioning mechanism to address the potential loss of expression-related information. Our approach effectively preserves pose and expression fidelity from the driving video while retaining the identity and fine details of the source image. Through experiments on the VoxCeleb dataset, we demonstrate that our method achieves state-of-the-art results in face reenactment, showcasing superior image quality, identity preservation, and expression accuracy, especially for cross-identity scenarios. Considering the ethical concerns surrounding potential misuse, we analyze the implications of our method, evaluate current state-of-the-art deepfake detectors, and identify their shortcomings to guide future research.
- [180] arXiv:2406.13274 [pdf, html, other]
-
Title: In-Context Learning on a Budget: A Case Study in Named Entity RecognitionSubjects: Computation and Language (cs.CL)
Few shot in-context learning (ICL) typically assumes access to large annotated training sets. However, in many real world scenarios, such as domain adaptation, there is only a limited budget to annotate a small number of samples, with the goal of maximizing downstream performance. We study various methods for selecting samples to annotate within a predefined budget, specifically focusing on the named entity recognition (NER) task, which has real-world applications, is expensive to annotate, and is relatively less studied in ICL setups. Across different models and datasets, we find that a relatively small pool of annotated samples can achieve results comparable to using the entire training set. Moreover, we discover that random selection of samples for annotation yields surprisingly good performance. Finally, we observe that a diverse annotation pool is correlated with improved performance. We hope that future work adopts our realistic paradigm which takes annotation budget into account.
- [181] arXiv:2406.13275 [pdf, html, other]
-
Title: Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio EncodingJizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin WangComments: Accepted by Interspeech 2024Subjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED) is used to improve the effectivity of acoustic tokens, with a querying transformer (Q-Former) bridging the modality gap to LLM and compress acoustic tokens; 2) we investigate the advantages of using a Llama 2 with 7B parameters as the decoder; 3) another pre-trained LLM corrects text errors caused by insufficient training data and annotation ambiguities. Both the audio encoder and text decoder are optimized by -Base (LoRA). Experiments show that each of these enhancements is effective. Our method obtains a 33.0 SPIDEr-FL score, outperforming the winner of DCASE 2023 Task 6A.
- [182] arXiv:2406.13280 [pdf, other]
-
Title: Design Optimization of NOMA Aided Multi-STAR-RIS for Indoor Environments: A Convex Approximation Imitated Reinforcement Learning ApproachComments: 37 pages, 11 figures, IEEE Transactions on Communications submitted. arXiv admin note: text overlap with arXiv:2311.08708Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
Sixth-generation (6G) networks leverage simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RISs) to overcome the limitations of traditional RISs. STAR-RISs offer 360-degree full-space coverage and optimized transmission and reflection for enhanced network performance and dynamic control of the indoor propagation environment. However, deploying STAR-RISs indoors presents challenges in interference mitigation, power consumption, and real-time configuration. In this work, a novel network architecture utilizing multiple access points (APs) and STAR-RISs is proposed for indoor communication. An optimization problem encompassing user assignment, access point beamforming, and STAR-RIS phase control for reflection and transmission is formulated. The inherent complexity of the formulated problem necessitates a decomposition approach for an efficient solution. This involves tackling different sub-problems with specialized techniques: a many-to-one matching algorithm is employed to assign users to appropriate access points, optimizing resource allocation. To facilitate efficient resource management, access points are grouped using a correlation-based K-means clustering algorithm. Multi-agent deep reinforcement learning (MADRL) is leveraged to optimize the control of the STAR-RIS. Within the proposed MADRL framework, a novel approach is introduced where each decision variable acts as an independent agent, enabling collaborative learning and decision-making. Additionally, the proposed MADRL approach incorporates convex approximation (CA). This technique utilizes suboptimal solutions from successive convex approximation (SCA) to accelerate policy learning for the agents, thereby leading to faster environment adaptation and convergence. Simulations demonstrate significant network utility improvements compared to baseline approaches.
- [183] arXiv:2406.13281 [pdf, html, other]
-
Title: ECAFormer: Low-light Image Enhancement using Cross AttentionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Low-light image enhancement (LLIE) is vital for autonomous driving. Despite the importance, existing LLIE methods often prioritize robustness in overall brightness adjustment, which can come at the expense of detail preservation. To overcome this limitation,we propose the Hierarchical Mutual Enhancement via Cross-Attention transformer (ECAFormer), a novel network that utilizes Dual Multi-head Self Attention (DMSA) to enhance both visual and semantic features across scales, significantly preserving details during the process. The cross-attention mechanism in ECAFormer not only improves upon traditional enhancement techniques but also excels in maintaining a balance between global brightness adjustment and local detail retention. Our extensive experimental validation on renowned low-illumination datasets, including SID and LOL, and additional tests on dark road scenarios. or performance over existing methods in terms of illumination enhancement and noise reduction, while also optimizing computational complexity and parameter count, further boosting SSIM and PSNR metrics. Our project is available at this https URL.
- [184] arXiv:2406.13282 [pdf, html, other]
-
Title: Understanding the RoPE Extensions of Long-Context LLMs: An Attention PerspectiveSubjects: Computation and Language (cs.CL)
Enabling LLMs to handle lengthy context is currently a research hotspot. Most LLMs are built upon rotary position embedding (RoPE), a popular position encoding method. Therefore, a prominent path is to extrapolate the RoPE trained on comparably short texts to far longer texts. A heavy bunch of efforts have been dedicated to boosting the extrapolation via extending the formulations of the RoPE, however, few of them have attempted to showcase their inner workings comprehensively. In this paper, we are driven to offer a straightforward yet in-depth understanding of RoPE extensions from an attention perspective and on two benchmarking tasks. A broad array of experiments reveals several valuable findings: 1) Maintaining attention patterns to those at the pretrained length improves extrapolation; 2) Large attention uncertainty leads to retrieval errors; 3) Using longer continual pretraining lengths for RoPE extensions could reduce attention uncertainty and significantly enhance extrapolation.
- [185] arXiv:2406.13283 [pdf, html, other]
-
Title: Large-Scale Dataset Pruning in Adversarial Training through Data Importance ExtrapolationComments: 8 pages, 5 figures, 3 tables, to be published in ICML: DMLR workshopSubjects: Machine Learning (cs.LG)
Their vulnerability to small, imperceptible attacks limits the adoption of deep learning models to real-world systems. Adversarial training has proven to be one of the most promising strategies against these attacks, at the expense of a substantial increase in training time. With the ongoing trend of integrating large-scale synthetic data this is only expected to increase even further. Thus, the need for data-centric approaches that reduce the number of training samples while maintaining accuracy and robustness arises. While data pruning and active learning are prominent research topics in deep learning, they are as of now largely unexplored in the adversarial training literature. We address this gap and propose a new data pruning strategy based on extrapolating data importance scores from a small set of data to a larger set. In an empirical evaluation, we demonstrate that extrapolation-based pruning can efficiently reduce dataset size while maintaining robustness.
- [186] arXiv:2406.13294 [pdf, html, other]
-
Title: Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target TokensComments: 13 pagesSubjects: Multimedia (cs.MM); Machine Learning (cs.LG)
Vision-language models (VLMs) seamlessly integrate visual and textual data to perform tasks such as image classification, caption generation, and visual question answering. However, adversarial images often struggle to deceive all prompts effectively in the context of cross-prompt migration attacks, as the probability distribution of the tokens in these images tends to favor the semantics of the original image rather than the target tokens. To address this challenge, we propose a Contextual-Injection Attack (CIA) that employs gradient-based perturbation to inject target tokens into both visual and textual contexts, thereby improving the probability distribution of the target tokens. By shifting the contextual semantics towards the target tokens instead of the original image semantics, CIA enhances the cross-prompt transferability of adversarial images.Extensive experiments on the BLIP2, InstructBLIP, and LLaVA models show that CIA outperforms existing methods in cross-prompt transferability, demonstrating its potential for more effective adversarial strategies in VLMs.
- [187] arXiv:2406.13295 [pdf, other]
-
Title: Media Forensics and Deepfake Systematic SurveyNadeem Jabbar CH, Aqib Saghir, Ayaz Ahmad Meer, Salman Ahmad Sahi, Bilal Hassan, Siddiqui Muhammad YasirComments: 46 pages, 9 figures, 5 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Deepfake is a generative deep learning algorithm that creates or changes facial features in a very realistic way making it hard to differentiate the real from the fake features It can be used to make movies look better as well as to spread false information by imitating famous people In this paper many different ways to make a Deepfake are explained analyzed and separated categorically Using Deepfake datasets models are trained and tested for reliability through experiments Deepfakes are a type of facial manipulation that allow people to change their entire faces identities attributes and expressions The trends in the available Deepfake datasets are also discussed with a focus on how they have changed Using Deep learning a general Deepfake detection model is made Moreover the problems in making and detecting Deepfakes are also mentioned As a result of this survey it is expected that the development of new Deepfake based imaging tools will speed up in the future This survey gives indepth review of methods for manipulating images of face and various techniques to spot altered face images Four types of facial manipulation are specifically discussed which are attribute manipulation expression swap entire face synthesis and identity swap Across every manipulation category we yield information on manipulation techniques significant benchmarks for technical evaluation of counterfeit detection techniques available public databases and a summary of the outcomes of all such analyses From all of the topics in the survey we focus on the most recent development of Deepfake showing its advances and obstacles in detecting fake images
- [188] arXiv:2406.13299 [pdf, other]
-
Title: Empirical Evaluation of Integrated Trust Mechanism to Improve Trust in E-commerce ServicesSubjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Performance (cs.PF)
There are mostly two approaches to tackle trust management worldwide Strong and crisp and Soft and Social. We analyze the impact of integrated trust mechanism in three different e-commerce services. The trust aspect is a dormant element between potential users and being developed expert or internet systems. We support our integration by preside over an experiment in controlled laboratory environment. The model selected for the experiment is a composite of policy and reputation based trust mechanisms and widely acknowledged in e-commerce industry. The integration between policy and trust mechanism was accomplished through mapping process, weakness of one brought to a close with the strength of other. Furthermore, experiment has been supervised to validate the effectiveness of implementation by segregating both integrated and traditional trust mechanisms in learning system
- [189] arXiv:2406.13300 [pdf, other]
-
Title: LightGBM robust optimization algorithm based on topological data analysisSubjects: Machine Learning (cs.LG)
To enhance the robustness of the Light Gradient Boosting Machine (LightGBM) algorithm for image classification, a topological data analysis (TDA)-based robustness optimization algorithm for LightGBM, TDA-LightGBM, is proposed to address the interference of noise on image classification. Initially, the method partitions the feature engineering process into two streams: pixel feature stream and topological feature stream for feature extraction respectively. Subsequently, these pixel and topological features are amalgamated into a comprehensive feature vector, serving as the input for LightGBM in image classification tasks. This fusion of features not only encompasses traditional feature engineering methodologies but also harnesses topological structure information to more accurately encapsulate the intrinsic features of the image. The objective is to surmount challenges related to unstable feature extraction and diminished classification accuracy induced by data noise in conventional image processing. Experimental findings substantiate that TDA-LightGBM achieves a 3% accuracy improvement over LightGBM on the SOCOFing dataset across five classification tasks under noisy conditions. In noise-free scenarios, TDA-LightGBM exhibits a 0.5% accuracy enhancement over LightGBM on two classification tasks, achieving a remarkable accuracy of 99.8%. Furthermore, the method elevates the classification accuracy of the Ultrasound Breast Images for Breast Cancer dataset and the Masked CASIA WebFace dataset by 6% and 15%, respectively, surpassing LightGBM in the presence of noise. These empirical results underscore the efficacy of the TDA-LightGBM approach in fortifying the robustness of LightGBM by integrating topological features, thereby augmenting the performance of image classification tasks amidst data perturbations.
- [190] arXiv:2406.13301 [pdf, html, other]
-
Title: ARDuP: Active Region Video Diffusion for Universal PoliciesShuaiyi Huang, Mara Levy, Zhenyu Jiang, Anima Anandkumar, Yuke Zhu, Linxi Fan, De-An Huang, Abhinav ShrivastavaSubjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Sequential decision-making can be formulated as a text-conditioned video generation problem, where a video planner, guided by a text-defined goal, generates future frames visualizing planned actions, from which control actions are subsequently derived. In this work, we introduce Active Region Video Diffusion for Universal Policies (ARDuP), a novel framework for video-based policy learning that emphasizes the generation of active regions, i.e. potential interaction areas, enhancing the conditional policy's focus on interactive areas critical for task execution. This innovative framework integrates active region conditioning with latent diffusion models for video planning and employs latent representations for direct action decoding during inverse dynamic modeling. By utilizing motion cues in videos for automatic active region discovery, our method eliminates the need for manual annotations of active regions. We validate ARDuP's efficacy via extensive experiments on simulator CLIPort and the real-world dataset BridgeData v2, achieving notable improvements in success rates and generating convincingly realistic video plans.
- [191] arXiv:2406.13302 [pdf, html, other]
-
Title: Situational Instructions Database: Task Guidance in Dynamic EnvironmentsComments: 9 pages, 6 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
The Situational Instructions Database (SID) addresses the need for enhanced situational awareness in artificial intelligence (AI) systems operating in dynamic environments. By integrating detailed scene graphs with dynamically generated, task-specific instructions, SID provides a novel dataset that allows AI systems to perform complex, real-world tasks with improved context sensitivity and operational accuracy. This dataset leverages advanced generative models to simulate a variety of realistic scenarios based on the 3D Semantic Scene Graphs (3DSSG) dataset, enriching it with scenario-specific information that details environmental interactions and tasks. SID facilitates the development of AI applications that can adapt to new and evolving conditions without extensive retraining, supporting research in autonomous technology and AI-driven decision-making processes. This dataset is instrumental in developing robust, context-aware AI agents capable of effectively navigating and responding to unpredictable settings. Available for research and development, SID serves as a critical resource for advancing the capabilities of intelligent systems in complex environments. Dataset available at \url{this https URL}.
- [192] arXiv:2406.13303 [pdf, other]
-
Title: Integration of Policy and Reputation based Trust Mechanisms in e-Commerce IndustrySubjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multimedia (cs.MM); Social and Information Networks (cs.SI)
The e-commerce systems are being tackled from commerce behavior and internet technologies. Therefore, trust aspect between buyer-seller transactions is a potential element which needs to be addressed in competitive e-commerce industry. The e-commerce industry is currently handling two different trust approaches. First approach consists on centralized mechanism where digital credentials/set of rules assembled, called Policy based trust mechanisms . Second approach consists on decentralized trust mechanisms where reputation, points assembled and shared, called Reputation based trust mechanisms. The difference between reputation and policy based trust mechanism will be analyzed and recommendations would be proposed to increase trust between buyer and seller in e-commerce industry. The integration of trust mechanism is proposed through mapping process, strength of one mechanism with the weakness of other. The proposed model for integrated mechanism will be presented and illustrated how the proposed model will be used in real world e-commerce industry.
- [193] arXiv:2406.13305 [pdf, html, other]
-
Title: Multimodal MRI-based Detection of Amyloid Status in Alzheimer's Disease ContinuumGiorgio Dolci (1,2,3), Charles A. Ellis (3), Federica Cruciani (2), Lorenza Brusini (2), Anees Abrol (3), Ilaria Boscolo Galazzo (2), Gloria Menegaz (2), Vince D. Calhoun (3) ((1) Department of Computer Science, University of Verona, Verona, Italy, (2) Department of Engineering for Innovation Medicine, University of Verona, Verona, Italy, (3) Tri-Institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS), Georgia State University, Georgia Institute of Technology, Emory University, Atlanta, GA, USA)Comments: 15 pages, 3 figures, submitted to a journalSubjects: Artificial Intelligence (cs.AI)
Amyloid-$\beta$ (A$\beta$) plaques in conjunction with hyperphosphorylated tau proteins in the form of neurofibrillary tangles are the two neuropathological hallmarks of Alzheimer's disease (AD). In particular, the accumulation of A$\beta$ plaques, as evinced by the A/T/N (amyloid/tau/neurodegeneration) framework, marks the initial stage. Thus, the identification of individuals with A$\beta$ positivity could enable early diagnosis and potentially lead to more effective interventions. Deep learning methods relying mainly on amyloid PET images have been employed to this end. However, PET imaging has some disadvantages, including the need of radiotracers and expensive acquisitions. Hence, in this work, we propose a novel multimodal approach that integrates information from structural, functional, and diffusion MRI data to discriminate A$\beta$ status in the AD continuum. Our method achieved an accuracy of $0.762\pm0.04$. Furthermore, a \textit{post-hoc} explainability analysis (guided backpropagation) was performed to retrieve the brain regions that most influenced the model predictions. This analysis identified some key regions that were common across modalities, some of which were well-established AD-discriminative biomarkers and related to A$\beta$ deposition, such as the hippocampus, thalamus, precuneus, and cingulate gyrus. Hence, our study demonstrates the potential viability of MRI-based characterization of A$\beta$ status, paving the way for further research in this domain.
- [194] arXiv:2406.13308 [pdf, other]
-
Title: Deep Learning-Based 3D Instance and Semantic Segmentation: A ReviewSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
The process of segmenting point cloud data into several homogeneous areas with points in the same region having the same attributes is known as 3D segmentation. Segmentation is challenging with point cloud data due to substantial redundancy, fluctuating sample density and lack of apparent organization. The research area has a wide range of robotics applications, including intelligent vehicles, autonomous mapping and navigation. A number of researchers have introduced various methodologies and algorithms. Deep learning has been successfully used to a spectrum of 2D vision domains as a prevailing A.I. methods. However, due to the specific problems of processing point clouds with deep neural networks, deep learning on point clouds is still in its initial stages. This study examines many strategies that have been presented to 3D instance and semantic segmentation and gives a complete assessment of current developments in deep learning-based 3D segmentation. In these approaches benefits, draw backs, and design mechanisms are studied and addressed. This study evaluates the impact of various segmentation algorithms on competitiveness on various publicly accessible datasets, as well as the most often used pipelines, their advantages and limits, insightful findings and intriguing future research directions.
- [195] arXiv:2406.13316 [pdf, html, other]
-
Title: Reinforcing Pre-trained Models Using Counterfactual ImagesComments: 6 pages, 4 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images. Deep learning classification models are often trained using datasets that mirror real-world scenarios. In this training process, because learning is based solely on correlations with labels, there is a risk that models may learn spurious relationships, such as an overreliance on features not central to the subject, like background elements in images. However, due to the black-box nature of the decision-making process in deep learning models, identifying and addressing these vulnerabilities has been particularly challenging. We introduce a novel framework for reinforcing the classification models, which consists of a two-stage process. First, we identify model weaknesses by testing the model using the counterfactual image dataset, which is generated by perturbed image captions. Subsequently, we employ the counterfactual images as an augmented dataset to fine-tune and reinforce the classification model. Through extensive experiments on several classification models across various datasets, we revealed that fine-tuning with a small set of counterfactual images effectively strengthens the model.
- [196] arXiv:2406.13317 [pdf, html, other]
-
Title: M4Fog: A Global Multi-Regional, Multi-Modal, and Multi-Stage Dataset for Marine Fog Detection and Forecasting to Bridge Ocean and AtmosphereMengqiu Xu, Ming Wu, Kaixin Chen, Yixiang Huang, Mingrui Xu, Yujia Yang, Yiqing Feng, Yiying Guo, Bin Huang, Dongliang Chang, Zhenwei Shi, Chuang Zhang, Zhanyu Ma, Jun GuoSubjects: Computer Vision and Pattern Recognition (cs.CV)
Marine fog poses a significant hazard to global shipping, necessitating effective detection and forecasting to reduce economic losses. In recent years, several machine learning (ML) methods have demonstrated superior detection accuracy compared to traditional meteorological methods. However, most of these works are developed on proprietary datasets, and the few publicly accessible datasets are often limited to simplistic toy scenarios for research purposes. To advance the field, we have collected nearly a decade's worth of multi-modal data related to continuous marine fog stages from four series of geostationary meteorological satellites, along with meteorological observations and numerical analysis, covering 15 marine regions globally where maritime fog frequently occurs. Through pixel-level manual annotation by meteorological experts, we present the most comprehensive marine fog detection and forecasting dataset to date, named M4Fog, to bridge ocean and atmosphere. The dataset comprises 68,000 "super data cubes" along four dimensions: elements, latitude, longitude and time, with a temporal resolution of half an hour and a spatial resolution of 1 kilometer. Considering practical applications, we have defined and explored three meaningful tracks with multi-metric evaluation systems: static or dynamic marine fog detection, and spatio-temporal forecasting for cloud images. Extensive benchmarking and experiments demonstrate the rationality and effectiveness of the construction concept for proposed M4Fog. The data and codes are available to whole researchers through cloud platforms to develop ML-driven marine fog solutions and mitigate adverse impacts on human activities.
- [197] arXiv:2406.13322 [pdf, html, other]
-
Title: CLIP-Branches: Interactive Fine-Tuning for Text-Image RetrievalSubjects: Information Retrieval (cs.IR)
The advent of text-image models, most notably CLIP, has significantly transformed the landscape of information retrieval. These models enable the fusion of various modalities, such as text and images. One significant outcome of CLIP is its capability to allow users to search for images using text as a query, as well as vice versa. This is achieved via a joint embedding of images and text data that can, for instance, be used to search for similar items. Despite efficient query processing techniques such as approximate nearest neighbor search, the results may lack precision and completeness. We introduce CLIP-Branches, a novel text-image search engine built upon the CLIP architecture. Our approach enhances traditional text-image search engines by incorporating an interactive fine-tuning phase, which allows the user to further concretize the search query by iteratively defining positive and negative examples. Our framework involves training a classification model given the additional user feedback and essentially outputs all positively classified instances of the entire data catalog. By building upon recent techniques, this inference phase, however, is not implemented by scanning the entire data catalog, but by employing efficient index structures pre-built for the data. Our results show that the fine-tuned results can improve the initial search outputs in terms of relevance and accuracy while maintaining swift response times
- [198] arXiv:2406.13327 [pdf, html, other]
-
Title: Part-aware Unified Representation of Language and Skeleton for Zero-shot Action RecognitionSubjects: Computer Vision and Pattern Recognition (cs.CV)
While remarkable progress has been made on supervised skeleton-based action recognition, the challenge of zero-shot recognition remains relatively unexplored. In this paper, we argue that relying solely on aligning label-level semantics and global skeleton features is insufficient to effectively transfer locally consistent visual knowledge from seen to unseen classes. To address this limitation, we introduce Part-aware Unified Representation between Language and Skeleton (PURLS) to explore visual-semantic alignment at both local and global scales. PURLS introduces a new prompting module and a novel partitioning module to generate aligned textual and visual representations across different levels. The former leverages a pre-trained GPT-3 to infer refined descriptions of the global and local (body-part-based and temporal-interval-based) movements from the original action labels. The latter employs an adaptive sampling strategy to group visual features from all body joint movements that are semantically relevant to a given description. Our approach is evaluated on various skeleton/language backbones and three large-scale datasets, i.e., NTU-RGB+D 60, NTU-RGB+D 120, and a newly curated dataset Kinetics-skeleton 200. The results showcase the universality and superior performance of PURLS, surpassing prior skeleton-based solutions and standard baselines from other domains. The source codes can be accessed at this https URL.
- [199] arXiv:2406.13329 [pdf, html, other]
-
Title: On rough mereology and VC-dimension in treatment of decision prediction for open world decision systemsComments: This article is preliminarily submitted to ArXiv then the author will choose an appropriate journal probably Fundamenta InformaticaeSubjects: Machine Learning (cs.LG)
Given a raw knowledge in the form of a data table/a decision system, one is facing two possible venues. One, to treat the system as closed, i.e., its universe does not admit new objects, or, to the contrary, its universe is open on admittance of new objects. In particular, one may obtain new objects whose sets of values of features are new to the system. In this case the problem is to assign a decision value to any such new object. This problem is somehow resolved in the rough set theory, e.g., on the basis of similarity of the value set of a new object to value sets of objects already assigned a decision value. It is crucial for online learning when each new object must have a predicted decision value.\ There is a vast literature on various methods for decision prediction for new yet unseen object.
The approach we propose is founded in the theory of rough mereology and it requires a theory of sets/concepts, and, we root our theory in classical set theory of Syllogistic within which we recall the theory of parts known as Mereology. Then, we recall our theory of Rough Mereology along with the theory of weight assignment to the Tarski algebra of Mereology.\ This allows us to introduce the notion of a part to a degree. Once we have defined basics of Mereology and rough Mereology, we recall our theory of weight assignment to elements of the Boolean algebra within Mereology and this allows us to define the relation of parts to the degree and we apply this notion in a procedure to select a decision for new yet unseen objects.\ In selecting a plausible candidate which would pass its decision value to the new object, we employ the notion of Vapnik - Chervonenkis dimension in order to select at the first stage the candidate with the largest VC-dimension of the family of its $\varepsilon$-components for some choice of $\varepsilon$. - [200] arXiv:2406.13331 [pdf, html, other]
-
Title: Improving Zero-shot LLM Re-Ranker with Risk MinimizationComments: Under reviewSubjects: Computation and Language (cs.CL)
In the Retrieval-Augmented Generation (RAG) system, advanced Large Language Models (LLMs) have emerged as effective Query Likelihood Models (QLMs) in an unsupervised way, which re-rank documents based on the probability of generating the query given the content of a document. However, directly prompting LLMs to approximate QLMs inherently is biased, where the estimated distribution might diverge from the actual document-specific distribution. In this study, we introduce a novel framework, $\mathrm{UR^3}$, which leverages Bayesian decision theory to both quantify and mitigate this estimation bias. Specifically, $\mathrm{UR^3}$ reformulates the problem as maximizing the probability of document generation, thereby harmonizing the optimization of query and document generation probabilities under a unified risk minimization objective. Our empirical results indicate that $\mathrm{UR^3}$ significantly enhances re-ranking, particularly in improving the Top-1 accuracy. It benefits the QA tasks by achieving higher accuracy with fewer input documents.
- [201] arXiv:2406.13332 [pdf, html, other]
-
Title: How effective is Multi-source pivoting for Translation of Low Resource Indian Languages?Subjects: Computation and Language (cs.CL)
Machine Translation (MT) between linguistically dissimilar languages is challenging, especially due to the scarcity of parallel corpora. Prior works suggest that pivoting through a high-resource language can help translation into a related low-resource language. However, existing works tend to discard the source sentence when pivoting. Taking the case of English to Indian language MT, this paper explores the 'multi-source translation' approach with pivoting, using both source and pivot sentences to improve translation. We conducted extensive experiments with various multi-source techniques for translating English to Konkani, Manipuri, Sanskrit, and Bodo, using Hindi, Marathi, and Bengali as pivot languages. We find that multi-source pivoting yields marginal improvements over the state-of-the-art, contrary to previous claims, but these improvements can be enhanced with synthetic target language data. We believe multi-source pivoting is a promising direction for Low-resource translation.
- [202] arXiv:2406.13335 [pdf, html, other]
-
Title: AI-Empowered Multiple Access for 6G: A Survey of Spectrum Sensing, Protocol Designs, and OptimizationsSubjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
With the rapidly increasing number of bandwidth-intensive terminals capable of intelligent computing and communication, such as smart devices equipped with shallow neural network models, the complexity of multiple access for these intelligent terminals is increasing due to the dynamic network environment and ubiquitous connectivity in 6G systems. Traditional multiple access (MA) design and optimization methods are gradually losing ground to artificial intelligence (AI) techniques that have proven their superiority in handling complexity. AI-empowered MA and its optimization strategies aimed at achieving high Quality-of-Service (QoS) are attracting more attention, especially in the area of latency-sensitive applications in 6G systems. In this work, we aim to: 1) present the development and comparative evaluation of AI-enabled MA; 2) provide a timely survey focusing on spectrum sensing, protocol design, and optimization for AI-empowered MA; and 3) explore the potential use cases of AI-empowered MA in the typical application scenarios within 6G systems. Specifically, we first present a unified framework of AI-empowered MA for 6G systems by incorporating various promising machine learning techniques in spectrum sensing, resource allocation, MA protocol design, and optimization. We then introduce AI-empowered MA spectrum sensing related to spectrum sharing and spectrum interference management. Next, we discuss the AI-empowered MA protocol designs and implementation methods by reviewing and comparing the state-of-the-art, and we further explore the optimization algorithms related to dynamic resource management, parameter adjustment, and access scheme switching. Finally, we discuss the current challenges, point out open issues, and outline potential future research directions in this field.
- [203] arXiv:2406.13340 [pdf, html, other]
-
Title: SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond WordsJunyi Ao, Yuancheng Wang, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, Zhizheng WuSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction. Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, including speech. Although these models can be adept at recognizing and analyzing speech, they often fall short of generating appropriate responses. We argue that this is due to the lack of principles on task definition and model development, which requires open-source datasets and metrics suitable for model evaluation. To bridge the gap, we present SD-Eval, a benchmark dataset aimed at multidimensional evaluation of spoken dialogue understanding and generation. SD-Eval focuses on paralinguistic and environmental information and includes 7,303 utterances, amounting to 8.76 hours of speech data. The data is aggregated from eight public datasets, representing four perspectives: emotion, accent, age, and background sound. To assess the SD-Eval benchmark dataset, we implement three different models and construct a training set following a similar process as SD-Eval. The training set contains 1,052.72 hours of speech data and 724.4k utterances. We also conduct a comprehensive evaluation using objective evaluation methods (e.g. BLEU and ROUGE), subjective evaluations and LLM-based metrics for the generated responses. Models conditioned with paralinguistic and environmental information outperform their counterparts in both objective and subjective measures. Moreover, experiments demonstrate LLM-based metrics show a higher correlation with human evaluation compared to traditional metrics. We open-source SD-Eval at this https URL.
- [204] arXiv:2406.13342 [pdf, html, other]
-
Title: ZeroDL: Zero-shot Distribution Learning for Text Clustering via Large Language ModelsComments: ARR SubmittedSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
The recent advancements in large language models (LLMs) have brought significant progress in solving NLP tasks. Notably, in-context learning (ICL) is the key enabling mechanism for LLMs to understand specific tasks and grasping nuances. In this paper, we propose a simple yet effective method to contextualize a task toward a specific LLM, by (1) observing how a given LLM describes (all or a part of) target datasets, i.e., open-ended zero-shot inference, and (2) aggregating the open-ended inference results by the LLM, and (3) finally incorporate the aggregated meta-information for the actual task. We show the effectiveness of this approach in text clustering tasks, and also highlight the importance of the contextualization through examples of the above procedure.
- [205] arXiv:2406.13344 [pdf, html, other]
-
Title: WaterMono: Teacher-Guided Anomaly Masking and Enhancement Boosting for Robust Underwater Self-Supervised Monocular Depth EstimationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Depth information serves as a crucial prerequisite for various visual tasks, whether on land or underwater. Recently, self-supervised methods have achieved remarkable performance on several terrestrial benchmarks despite the absence of depth annotations. However, in more challenging underwater scenarios, they encounter numerous brand-new obstacles such as the influence of marine life and degradation of underwater images, which break the assumption of a static scene and bring low-quality images, respectively. Besides, the camera angles of underwater images are more diverse. Fortunately, we have discovered that knowledge distillation presents a promising approach for tackling these challenges. In this paper, we propose WaterMono, a novel framework for depth estimation coupled with image enhancement. It incorporates the following key measures: (1) We present a Teacher-Guided Anomaly Mask to identify dynamic regions within the images; (2) We employ depth information combined with the Underwater Image Formation Model to generate enhanced images, which in turn contribute to the depth estimation task; and (3) We utilize a rotated distillation strategy to enhance the model's rotational robustness. Comprehensive experiments demonstrate the effectiveness of our proposed method for both depth estimation and image enhancement. The source code and pre-trained models are available on the project home page: this https URL.
- [206] arXiv:2406.13345 [pdf, html, other]
-
Title: Low Latency Visual Inertial Odometry with On-Sensor Accelerated Optical Flow for Resource-Constrained UAVsComments: This article has been accepted for publication in the IEEE Sensors Journal (JSEN)Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Visual Inertial Odometry (VIO) is the task of estimating the movement trajectory of an agent from an onboard camera stream fused with additional Inertial Measurement Unit (IMU) measurements. A crucial subtask within VIO is the tracking of features, which can be achieved through Optical Flow (OF). As the calculation of OF is a resource-demanding task in terms of computational load and memory footprint, which needs to be executed at low latency, especially in robotic applications, OF estimation is today performed on powerful CPUs or GPUs. This restricts its use in a broad spectrum of applications where the deployment of such powerful, power-hungry processors is unfeasible due to constraints related to cost, size, and power consumption. On-sensor hardware acceleration is a promising approach to enable low latency VIO even on resource-constrained devices such as nano drones. This paper assesses the speed-up in a VIO sensor system exploiting a compact OF sensor consisting of a global shutter camera and an Application Specific Integrated Circuit (ASIC). By replacing the feature tracking logic of the VINS-Mono pipeline with data from this OF camera, we demonstrate a 49.4% reduction in latency and a 53.7% reduction of compute load of the VIO pipeline over the original VINS-Mono implementation, allowing VINS-Mono operation up to 50 FPS instead of 20 FPS on the quad-core ARM Cortex-A72 processor of a Raspberry Pi Compute Module 4.
- [207] arXiv:2406.13348 [pdf, html, other]
-
Title: Textual Unlearning Gives a False Sense of UnlearningSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Language models (LMs) are susceptible to "memorizing" training data, including a large amount of private or copyright-protected content. To safeguard the right to be forgotten (RTBF), machine unlearning has emerged as a promising method for LMs to efficiently "forget" sensitive training content and mitigate knowledge leakage risks. However, despite its good intentions, could the unlearning mechanism be counterproductive? In this paper, we propose the Textual Unlearning Leakage Attack (TULA), where an adversary can infer information about the unlearned data only by accessing the models before and after unlearning. Furthermore, we present variants of TULA in both black-box and white-box scenarios. Through various experimental results, we critically demonstrate that machine unlearning amplifies the risk of knowledge leakage from LMs. Specifically, TULA can increase an adversary's ability to infer membership information about the unlearned data by more than 20% in black-box scenario. Moreover, TULA can even reconstruct the unlearned data directly with more than 60% accuracy with white-box access. Our work is the first to reveal that machine unlearning in LMs can inversely create greater knowledge risks and inspire the development of more secure unlearning mechanisms.
- [208] arXiv:2406.13351 [pdf, html, other]
-
Title: A Resource-Adaptive Approach for Federated Learning under Resource-Constrained EnvironmentsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
The paper studies a fundamental federated learning (FL) problem involving multiple clients with heterogeneous constrained resources. Compared with the numerous training parameters, the computing and communication resources of clients are insufficient for fast local training and real-time knowledge sharing. Besides, training on clients with heterogeneous resources may result in the straggler problem. To address these issues, we propose Fed-RAA: a Resource-Adaptive Asynchronous Federated learning algorithm. Different from vanilla FL methods, where all parameters are trained by each participating client regardless of resource diversity, Fed-RAA adaptively allocates fragments of the global model to clients based on their computing and communication capabilities. Each client then individually trains its assigned model fragment and asynchronously uploads the updated result. Theoretical analysis confirms the convergence of our approach. Additionally, we design an online greedy-based algorithm for fragment allocation in Fed-RAA, achieving fairness comparable to an offline strategy. We present numerical results on MNIST, CIFAR-10, and CIFAR-100, along with necessary comparisons and ablation studies, demonstrating the advantages of our work. To the best of our knowledge, this paper represents the first resource-adaptive asynchronous method for fragment-based FL with guaranteed theoretical convergence.
- [209] arXiv:2406.13352 [pdf, html, other]
-
Title: AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM AgentsSubjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
AI agents aim to solve complex tasks by combining text-based reasoning with external tool calls. Unfortunately, AI agents are vulnerable to prompt injection attacks where data returned by external tools hijacks the agent to execute malicious tasks. To measure the adversarial robustness of AI agents, we introduce AgentDojo, an evaluation framework for agents that execute tools over untrusted data. To capture the evolving nature of attacks and defenses, AgentDojo is not a static test suite, but rather an extensible environment for designing and evaluating new agent tasks, defenses, and adaptive attacks. We populate the environment with 97 realistic tasks (e.g., managing an email client, navigating an e-banking website, or making travel bookings), 629 security test cases, and various attack and defense paradigms from the literature. We find that AgentDojo poses a challenge for both attacks and defenses: state-of-the-art LLMs fail at many tasks (even in the absence of attacks), and existing prompt injection attacks break some security properties but not all. We hope that AgentDojo can foster research on new design principles for AI agents that solve common tasks in a reliable and robust manner. We release the code for AgentDojo at this https URL.
- [210] arXiv:2406.13355 [pdf, html, other]
-
Title: Linear codes in the folded Hamming distance and the quasi MDS propertySubjects: Information Theory (cs.IT)
In this work, we study linear codes with the folded Hamming distance, or equivalently, codes with the classical Hamming distance that are linear over a subfield. This includes additive codes. We study MDS codes in this setting and define quasi MDS (QMDS) codes and dually QMDS codes, which attain a more relaxed variant of the classical Singleton bound. We provide several general results concerning these codes, including restriction, shortening, weight distributions, existence, density, geometric description and bounds on their lengths relative to their field sizes. We provide explicit examples and a binary construction with optimal lengths relative to their field sizes, which beats any MDS code.
- [211] arXiv:2406.13356 [pdf, html, other]
-
Title: Jogging the Memory of Unlearned Model Through Targeted Relearning AttackComments: 17 pages, 8 figures, 12 tablesSubjects: Machine Learning (cs.LG)
Machine unlearning is a promising approach to mitigate undesirable memorization of training data in ML models. However, in this work we show that existing approaches for unlearning in LLMs are surprisingly susceptible to a simple set of targeted relearning attacks. With access to only a small and potentially loosely related set of data, we find that we can 'jog' the memory of unlearned models to reverse the effects of unlearning. We formalize this unlearning-relearning pipeline, explore the attack across three popular unlearning benchmarks, and discuss future directions and guidelines that result from our study.
- [212] arXiv:2406.13357 [pdf, html, other]
-
Title: Transferable speech-to-text large language model alignment moduleComments: Accepted by InterSpeech 2024; 5 pages, 2 figuresSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
By leveraging the power of Large Language Models(LLMs) and speech foundation models, state of the art speech-text bimodal works can achieve challenging tasks like spoken translation(ST) and question answering(SQA) altogether with much simpler architectures. In this paper, we utilize the capability of Whisper encoder and pre-trained Yi-6B. Empirical results reveal that modal alignment can be achieved with one layer module and hundred hours of speech-text multitask corpus. We further swap the Yi-6B with human preferences aligned version of Yi-6B-Chat during inference, and discover that the alignment capability is applicable as well. In addition, the alignment subspace revealed by singular value decomposition(SVD) also implies linear alignment subspace is sparse, which leaves the possibility to concatenate other features like voice-print or video to expand modality.
- [213] arXiv:2406.13358 [pdf, html, other]
-
Title: Multi-scale Restoration of Missing Data in Optical Time-series Images with Masked Spatial-Temporal Attention NetworkSubjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Due to factors such as thick cloud cover and sensor limitations, remote sensing images often suffer from significant missing data, resulting in incomplete time-series information. Existing methods for imputing missing values in remote sensing images do not fully exploit spatio-temporal auxiliary information, leading to limited accuracy in restoration. Therefore, this paper proposes a novel deep learning-based approach called MS2TAN (Multi-scale Masked Spatial-Temporal Attention Network), for reconstructing time-series remote sensing images. Firstly, we introduce an efficient spatio-temporal feature extractor based on Masked Spatial-Temporal Attention (MSTA), to obtain high-quality representations of the spatio-temporal neighborhood features in the missing regions. Secondly, a Multi-scale Restoration Network consisting of the MSTA-based Feature Extractors, is employed to progressively refine the missing values by exploring spatio-temporal neighborhood features at different scales. Thirdly, we propose a ``Pixel-Structure-Perception'' Multi-Objective Joint Optimization method to enhance the visual effects of the reconstruction results from multiple perspectives and preserve more texture structures. Furthermore, the proposed method reconstructs missing values in all input temporal phases in parallel (i.e., Multi-In Multi-Out), achieving higher processing efficiency. Finally, experimental evaluations on two typical missing data restoration tasks across multiple research areas demonstrate that the proposed method outperforms state-of-the-art methods with an improvement of 0.40dB/1.17dB in mean peak signal-to-noise ratio (mPSNR) and 3.77/9.41 thousandths in mean structural similarity (mSSIM), while exhibiting stronger texture and structural consistency.
- [214] arXiv:2406.13359 [pdf, html, other]
-
Title: Search-based DNN Testing and Retraining with GAN-enhanced SimulationsComments: 14 pages, 3 figures, 7 tablesSubjects: Software Engineering (cs.SE)
In safety-critical systems (e.g., autonomous vehicles and robots), Deep Neural Networks (DNNs) are becoming a key component for computer vision tasks, particularly semantic segmentation. Further, since the DNN behavior cannot be assessed through code inspection and analysis, test automation has become an essential activity to gain confidence in the reliability of DNNs. Unfortunately, state-of-the-art automated testing solutions largely rely on simulators, whose fidelity is always imperfect, thus affecting the validity of test results. To address such limitations, we propose to combine meta-heuristic search, used to explore the input space using simulators, with Generative Adversarial Networks (GANs), to transform the data generated by simulators into realistic input images. Such images can be used both to assess the DNN performance and to retrain the DNN more effectively. We applied our approach to a state-of-the-art DNN performing semantic segmentation and demonstrated that it outperforms a state-of-the-art GAN-based testing solution and several baselines. Specifically, it leads to the largest number of diverse images leading to the worst DNN performance. Further, the images generated with our approach, lead to the highest improvement in DNN performance when used for retraining. In conclusion, we suggest to always integrate GAN components when performing search-driven, simulator-based testing.
- [215] arXiv:2406.13360 [pdf, other]
-
Title: Birkhoff style proof systems for hybrid-dynamic quantum logicComments: arXiv admin note: text overlap with arXiv:2406.02085Subjects: Logic in Computer Science (cs.LO)
We explore a simple approach to quantum logic based on hybrid and dynamic modal logic, where the set of states is given by some Hilbert space. In this setting, a notion of quantum clause is proposed in a similar way the notion of Horn clause is advanced in first-order logic, that is, to give logical properties for use in logic programming and formal specification. We propose proof rules for reasoning about quantum clauses and we investigate soundness and compactness properties that correspond to this proof calculus. Then we prove a Birkhoff completeness result for the fragment of hybrid-dynamic quantum logic determined by quantum clauses.
- [216] arXiv:2406.13361 [pdf, html, other]
-
Title: Improving Zero-Shot Cross-Lingual Transfer via Progressive Code-SwitchingComments: 9 pages, 5 figures, 6 tables. Accepted by International Joint Conference on Artificial Intelligence (IJCAI 2024)Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Code-switching is a data augmentation scheme mixing words from multiple languages into source lingual text. It has achieved considerable generalization performance of cross-lingual transfer tasks by aligning cross-lingual contextual word representations. However, uncontrolled and over-replaced code-switching would augment dirty samples to model training. In other words, the excessive code-switching text samples will negatively hurt the models' cross-lingual transferability. To this end, we propose a Progressive Code-Switching (PCS) method to gradually generate moderately difficult code-switching examples for the model to discriminate from easy to hard. The idea is to incorporate progressively the preceding learned multilingual knowledge using easier code-switching data to guide model optimization on succeeding harder code-switching data. Specifically, we first design a difficulty measurer to measure the impact of replacing each word in a sentence based on the word relevance score. Then a code-switcher generates the code-switching data of increasing difficulty via a controllable temperature variable. In addition, a training scheduler decides when to sample harder code-switching data for model training. Experiments show our model achieves state-of-the-art results on three different zero-shot cross-lingual transfer tasks across ten languages.
- [217] arXiv:2406.13362 [pdf, html, other]
-
Title: VisualRWKV: Exploring Recurrent Neural Networks for Visual Language ModelsComments: 18 pages,14 tables,6 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Visual Language Models (VLMs) have rapidly progressed with the recent success of large language models. However, there have been few attempts to incorporate efficient linear Recurrent Neural Networks (RNNs) architectures into VLMs. In this study, we introduce VisualRWKV, the first application of a linear RNN model to multimodal learning tasks, leveraging the pre-trained RWKV language model. We propose a data-dependent recurrence and sandwich prompts to enhance our modeling capabilities, along with a 2D image scanning mechanism to enrich the processing of visual sequences. Extensive experiments demonstrate that VisualRWKV achieves competitive performance compared to Transformer-based models like LLaVA-1.5 on various benchmarks. To facilitate further research and analysis, we have made the checkpoints and the associated code publicly accessible at the following GitHub repository: \href{this https URL}{this https URL}.
- [218] arXiv:2406.13363 [pdf, html, other]
-
Title: Evaluating Structural Generalization in Neural Machine TranslationComments: To appear at ACL 2024 findingsSubjects: Computation and Language (cs.CL)
Compositional generalization refers to the ability to generalize to novel combinations of previously observed words and syntactic structures. Since it is regarded as a desired property of neural models, recent work has assessed compositional generalization in machine translation as well as semantic parsing. However, previous evaluations with machine translation have focused mostly on lexical generalization (i.e., generalization to unseen combinations of known words). Thus, it remains unclear to what extent models can translate sentences that require structural generalization (i.e., generalization to different sorts of syntactic structures). To address this question, we construct SGET, a machine translation dataset covering various types of compositional generalization with control of words and sentence structures. We evaluate neural machine translation models on SGET and show that they struggle more in structural generalization than in lexical generalization. We also find different performance trends in semantic parsing and machine translation, which indicates the importance of evaluations across various tasks.
- [219] arXiv:2406.13365 [pdf, html, other]
-
Title: PPT-GNN: A Practical Pre-Trained Spatio-Temporal Graph Neural Network for Network SecurityComments: Paper currently under review. Code will be made public upon acceptance. 8 pages long, 4 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
Recent works have demonstrated the potential of Graph Neural Networks (GNN) for network intrusion detection. Despite their advantages, a significant gap persists between real-world scenarios, where detection speed is critical, and existing proposals, which operate on large graphs representing several hours of traffic. This gap results in unrealistic operational conditions and impractical detection delays. Moreover, existing models do not generalize well across different networks, hampering their deployment in production environments. To address these issues, we introduce PPTGNN, a practical spatio-temporal GNN for intrusion detection. PPTGNN enables near real-time predictions, while better capturing the spatio-temporal dynamics of network attacks. PPTGNN employs self-supervised pre-training for improved performance and reduced dependency on labeled data. We evaluate PPTGNN on three public datasets and show that it significantly outperforms state-of-the-art models, such as E-ResGAT and E-GraphSAGE, with an average accuracy improvement of 10.38%. Finally, we show that a pre-trained PPTGNN can easily be fine-tuned to unseen networks with minimal labeled examples. This highlights the potential of PPTGNN as a general, large-scale pre-trained model that can effectively operate in diverse network environments.
- [220] arXiv:2406.13366 [pdf, html, other]
-
Title: Learning the Approach During the Short-loading Cycle Using Reinforcement LearningSubjects: Robotics (cs.RO)
The short-loading cycle is a repetitive task performed in high quantities, making it a great alternative for automation. In the short-loading cycle, an expert operator navigates towards a pile, fills the bucket with material, navigates to a dump truck, and dumps the material into the tipping body. The operator has to balance the productivity goal while minimising the fuel usage, to maximise the overall efficiency of the cycle. In addition, difficult interactions, such as the tyre-to-surface interaction further complicate the cycle. These types of hard-to-model interactions that can be difficult to address with rule-based systems, together with the efficiency requirements, motivate us to examine the potential of data-driven approaches. In this paper, the possibility of teaching an agent through reinforcement learning to approach a dump truck's tipping body and get in position to dump material in the tipping body is examined. The agent is trained in a 3D simulated environment to perform a simplified navigation task. The trained agent is directly transferred to a real vehicle, to perform the same task, with no additional training. The results indicate that the agent can successfully learn to navigate towards the dump truck with a limited amount of control signals in simulation and when transferred to a real vehicle, exhibits the correct behaviour.
- [221] arXiv:2406.13367 [pdf, other]
-
Title: The disruption index in the multiverse: The calculation of scores comes with numerous (hidden) degrees of freedomComments: 8 pages, 1 tableSubjects: Digital Libraries (cs.DL)
Following Funk and Owen-Smith (2017), Wu et al. (2019) proposed the disruption index (DI1) as a bibliometric indicator that measures disruptive and consolidating research. When we summarized the literature on the disruption index for our recently published review article (Leibel & Bornmann, 2024), we noticed that the calculation of disruption scores comes with numerous (hidden) degrees of freedom. In this Letter to the Editor, we explain why this analytical flexibility endangers the credibility of bibliometric research based on the DI1 (and its variants) and advertise the application of multiverse-style methods to increase the transparency of the research.
- [222] arXiv:2406.13369 [pdf, other]
-
Title: Effective Edge-wise Representation Learning in Edge-Attributed Bipartite GraphsComments: 11 pages. Full version of the research paper accepted to KDD 2024Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Graph representation learning (GRL) is to encode graph elements into informative vector representations, which can be used in downstream tasks for analyzing graph-structured data and has seen extensive applications in various domains. However, the majority of extant studies on GRL are geared towards generating node representations, which cannot be readily employed to perform edge-based analytics tasks in edge-attributed bipartite graphs (EABGs) that pervade the real world, e.g., spam review detection in customer-product reviews and identifying fraudulent transactions in user-merchant networks. Compared to node-wise GRL, learning edge representations (ERL) on such graphs is challenging due to the need to incorporate the structure and attribute semantics from the perspective of edges while considering the separate influence of two heterogeneous node sets U and V in bipartite graphs. To our knowledge, despite its importance, limited research has been devoted to this frontier, and existing workarounds all suffer from sub-par results.
Motivated by this, this paper designs EAGLE, an effective ERL method for EABGs. Building on an in-depth and rigorous theoretical analysis, we propose the factorized feature propagation (FFP) scheme for edge representations with adequate incorporation of long-range dependencies of edges/features without incurring tremendous computation overheads. We further ameliorate FFP as a dual-view FFP by taking into account the influences from nodes in U and V severally in ERL. Extensive experiments on 5 real datasets showcase the effectiveness of the proposed EAGLE models in semi-supervised edge classification tasks. In particular, EAGLE can attain a considerable gain of at most 38.11% in AP and 1.86% in AUC when compared to the best baselines. - [223] arXiv:2406.13371 [pdf, other]
-
Title: Identifiable Causal Representation Learning: Unsupervised, Multi-View, and Multi-EnvironmentComments: PhD Thesis; 190 pages, 33 figures, 6 tablesJournal-ref: University of Cambridge, 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Causal models provide rich descriptions of complex systems as sets of mechanisms by which each variable is influenced by its direct causes. They support reasoning about manipulating parts of the system and thus hold promise for addressing some of the open challenges of artificial intelligence (AI), such as planning, transferring knowledge in changing environments, or robustness to distribution shifts. However, a key obstacle to more widespread use of causal models in AI is the requirement that the relevant variables be specified a priori, which is typically not the case for the high-dimensional, unstructured data processed by modern AI systems. At the same time, machine learning (ML) has proven quite successful at automatically extracting useful and compact representations of such complex data. Causal representation learning (CRL) aims to combine the core strengths of ML and causality by learning representations in the form of latent variables endowed with causal model semantics.
In this thesis, we study and present new results for different CRL settings. A central theme is the question of identifiability: Given infinite data, when are representations satisfying the same learning objective guaranteed to be equivalent? This is an important prerequisite for CRL, as it formally characterises if and when a learning task is, at least in principle, feasible. Since learning causal models, even without a representation learning component, is notoriously difficult, we require additional assumptions on the model class or rich data beyond the classical i.i.d. setting. By partially characterising identifiability for different settings, this thesis investigates what is possible for CRL without direct supervision, and thus contributes to its theoretical foundations. Ideally, the developed insights can help inform data collection practices or inspire the design of new practical estimation methods. - [224] arXiv:2406.13372 [pdf, html, other]
-
Title: Thread: A Logic-Based Data Organization Paradigm for How-To Question Answering with Retrieval Augmented GenerationKaikai An, Fangkai Yang, Liqun Li, Junting Lu, Sitao Cheng, Lu Wang, Pu Zhao, Lele Cao, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi ZhangComments: 21 pages, 4 figuresSubjects: Artificial Intelligence (cs.AI)
Current question answering systems leveraging retrieval augmented generation perform well in answering factoid questions but face challenges with non-factoid questions, particularly how-to queries requiring detailed step-by-step instructions and explanations. In this paper, we introduce Thread, a novel data organization paradigm that transforms documents into logic units based on their inter-connectivity. Extensive experiments across open-domain and industrial scenarios demonstrate that Thread outperforms existing data organization paradigms in RAG-based QA systems, significantly improving the handling of how-to questions.
- [225] arXiv:2406.13374 [pdf, html, other]
-
Title: State Anti-windup: A New Methodology for Tackling State Constraints at Both Synthesis and Implementation LevelsComments: 15 pages 21 FiguresSubjects: Systems and Control (eess.SY)
The anti-windup compensation typically addresses strict control limitations in control systems. However, there is a clear need for an equivalent solution for the states/outputs of the system. This paper introduces a novel methodology for the state anti-windup compensator. Unlike state-constrained control methods, which often focus on incorporating soft constraints into the design or fail to react adequately to constraint violations in practical settings, the proposed methodology treats state constraints as implement-oriented soft-hard constraints. This is achieved by integrating a saturation block within the structure of the safety compensator, referred to as the state anti-windup (SANTW) compensator. Similar to input anti-windup schemes, the SANTW design is separated from the nominal controller design. The problem is formulated as a disturbance rejection one to directly minimize the saturation. The paper develops two Hinf optimization frameworks using frequency-domain solutions and linear matrix inequalities. It then addresses constraints on both inputs and states, resulting in a unified Input-State Anti-windup (IS-ANTW) compensator synthesized using non-smooth Hinf optimization. This method also offers the flexibility of having a fixed-order compensator, crucial in many practical applications. Additionally, the study evaluates the proposed compensator's performance in managing current fluctuations from renewable energy sources during grid faults, demonstrating its effectiveness through detailed Electromagnetic Transient (EMT) simulations of grid-connected DC-AC converters.
- [226] arXiv:2406.13375 [pdf, html, other]
-
Title: ALiiCE: Evaluating Positional Fine-grained Citation GenerationSubjects: Computation and Language (cs.CL)
Large Language Models (LLMs) can enhance the credibility and verifiability by generating text with citations. However, existing tasks and evaluation methods are predominantly limited to sentence-level statement, neglecting the significance of positional fine-grained citations that can appear anywhere within sentences. To facilitate further exploration of the fine-grained citation generation, we propose ALiiCE, the first automatic evaluation framework for this task. Our framework first parses the sentence claim into atomic claims via dependency analysis and then calculates citation quality at the atomic claim level. ALiiCE introduces three novel metrics for positional fined-grained citation quality assessment, including positional fine-grained citation recall and precision, and coefficient of variation of citation positions. We evaluate the positional fine-grained citation generation performance of several LLMs on two long-form QA datasets. Our experiments and analyses demonstrate the effectiveness and reasonableness of ALiiCE. The results also indicate that existing LLMs still struggle to provide positional fine-grained citations.
- [227] arXiv:2406.13376 [pdf, html, other]
-
Title: Efficient Offline Reinforcement Learning: The Critic is CriticalSubjects: Machine Learning (cs.LG)
Recent work has demonstrated both benefits and limitations from using supervised approaches (without temporal-difference learning) for offline reinforcement learning. While off-policy reinforcement learning provides a promising approach for improving performance beyond supervised approaches, we observe that training is often inefficient and unstable due to temporal difference bootstrapping. In this paper we propose a best-of-both approach by first learning the behavior policy and critic with supervised learning, before improving with off-policy reinforcement learning. Specifically, we demonstrate improved efficiency by pre-training with a supervised Monte-Carlo value-error, making use of commonly neglected downstream information from the provided offline trajectories. We find that we are able to more than halve the training time of the considered offline algorithms on standard benchmarks, and surprisingly also achieve greater stability. We further build on the importance of having consistent policy and value functions to propose novel hybrid algorithms, TD3+BC+CQL and EDAC+BC, that regularize both the actor and the critic towards the behavior policy. This helps to more reliably improve on the behavior policy when learning from limited human demonstrations. Code is available at this https URL
- [228] arXiv:2406.13378 [pdf, html, other]
-
Title: Any360D: Towards 360 Depth Anything with Unlabeled 360 Data and M\"obius Spatial AugmentationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Recently, Depth Anything Model (DAM) - a type of depth foundation model - reveals impressive zero-shot capacity for diverse perspective images. Despite its success, it remains an open question regarding DAM's performance on 360 images that enjoy a large field-of-view (180x360) but suffer from spherical distortions. To this end, we establish, to our knowledge, the first benchmark that aims to 1) evaluate the performance of DAM on 360 images and 2) develop a powerful 360 DAM for the benefit of the community. For this, we conduct a large suite of experiments that consider the key properties of 360 images, e.g., different 360 representations, various spatial transformations, and diverse indoor and outdoor scenes. This way, our benchmark unveils some key findings, e.g., DAM is less effective for diverse 360 scenes and sensitive to spatial transformations. To address these challenges, we first collect a large-scale unlabeled dataset including diverse indoor and outdoor scenes. We then propose a semi-supervised learning (SSL) framework to learn a 360 DAM, dubbed Any360D. Under the umbrella of SSL, Any360D first learns a teacher model by fine-tuning DAM via metric depth supervision. Then, we train the student model by uncovering the potential of large-scale unlabeled data with pseudo labels from the teacher model. Möbius transformation-based spatial augmentation (MTSA) is proposed to impose consistency regularization between the unlabeled data and spatially transformed ones. This subtly improves the student model's robustness to various spatial transformations even under severe distortions. Extensive experiments demonstrate that Any360D outperforms DAM and many prior data-specific models, e.g., PanoFormer, across diverse scenes, showing impressive zero-shot capacity for being a 360 depth foundation model.
- [229] arXiv:2406.13380 [pdf, html, other]
-
Title: D3: An Adaptive Reconfigurable Datacenter NetworkSubjects: Networking and Internet Architecture (cs.NI)
The explosively growing communication traffic in datacenters imposes increasingly stringent performance requirements on the underlying networks. Over the last years, researchers have developed innovative optical switching technologies that enable reconfigurable datacenter networks (RCDNs) which support very fast topology reconfigurations. This paper presents D3, a novel and feasible RDCN architecture that improves throughput and flow completion time. D3 quickly and jointly adapts its links and packet scheduling toward the evolving demand, combining both demand-oblivious and demand-aware behaviors when needed. D3 relies on a decentralized network control plane supporting greedy, integrated-multihop, IP-based routing, allowing to react, quickly and locally, to topological changes without overheads. A rack-local synchronization and transport layer further support fast network adjustments. Moreover, we argue that D3 can be implemented using the recently proposed Sirius architecture (SIGCOMM 2020). We report on an extensive empirical evaluation using packet-level simulations. We find that D3 improves throughput by up to 15% and preserves competitive flow completion times compared to the state of the art. We further provide an analytical explanation of the superiority of D3, introducing an extension of the well-known Birkhoff-von Neumann decomposition, which may be of independent interest.
- [230] arXiv:2406.13381 [pdf, html, other]
-
Title: CoAct: A Global-Local Hierarchy for Autonomous Agent CollaborationComments: 9 pages, 4 figuresSubjects: Computation and Language (cs.CL)
Existing LLMs exhibit remarkable performance on various NLP tasks, but still struggle with complex real-world tasks, even equipped with advanced strategies like CoT and ReAct. In this work, we propose the CoAct framework, which transfers the hierarchical planning and collaboration patterns in human society to LLM systems. Specifically, our CoAct framework involves two agents: (1) A global planning agent, to comprehend the problem scope, formulate macro-level plans and provide detailed sub-task descriptions to local execution agents, which serves as the initial rendition of a global plan. (2) A local execution agent, to operate within the multi-tier task execution structure, focusing on detailed execution and implementation of specific tasks within the global plan. Experimental results on the WebArena benchmark show that CoAct can re-arrange the process trajectory when facing failures, and achieves superior performance over baseline methods on long-horizon web tasks. Code is available at this https URL.
- [231] arXiv:2406.13384 [pdf, html, other]
-
Title: Straight Through Gumbel Softmax Estimator based Bimodal Neural Architecture Search for Audio-Visual Deepfake DetectionSubjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Deepfakes are a major security risk for biometric authentication. This technology creates realistic fake videos that can impersonate real people, fooling systems that rely on facial features and voice patterns for identification. Existing multimodal deepfake detectors rely on conventional fusion methods, such as majority rule and ensemble voting, which often struggle to adapt to changing data characteristics and complex patterns. In this paper, we introduce the Straight-through Gumbel-Softmax (STGS) framework, offering a comprehensive approach to search multimodal fusion model architectures. Using a two-level search approach, the framework optimizes the network architecture, parameters, and performance. Initially, crucial features were efficiently identified from backbone networks, whereas within the cell structure, a weighted fusion operation integrated information from various sources. An architecture that maximizes the classification performance is derived by varying parameters such as temperature and sampling time. The experimental results on the FakeAVCeleb and SWAN-DF datasets demonstrated an impressive AUC value 94.4\% achieved with minimal model parameters.
- [232] arXiv:2406.13392 [pdf, html, other]
-
Title: Strengthening Layer Interaction via Dynamic Layer AttentionComments: Accepted by IJCAI2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
In recent years, employing layer attention to enhance interaction among hierarchical layers has proven to be a significant advancement in building network structures. In this paper, we delve into the distinction between layer attention and the general attention mechanism, noting that existing layer attention methods achieve layer interaction on fixed feature maps in a static manner. These static layer attention methods limit the ability for context feature extraction among layers. To restore the dynamic context representation capability of the attention mechanism, we propose a Dynamic Layer Attention (DLA) architecture. The DLA comprises dual paths, where the forward path utilizes an improved recurrent neural network block, named Dynamic Sharing Unit (DSU), for context feature extraction. The backward path updates features using these shared context representations. Finally, the attention mechanism is applied to these dynamically refreshed feature maps among layers. Experimental results demonstrate the effectiveness of the proposed DLA architecture, outperforming other state-of-the-art methods in image recognition and object detection tasks. Additionally, the DSU block has been evaluated as an efficient plugin in the proposed DLA architecture.The code is available at this https URL.
- [233] arXiv:2406.13393 [pdf, html, other]
-
Title: Style-NeRF2NeRF: 3D Style Transfer From Style-Aligned Multi-View ImagesComments: 16 pages, 9 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
We propose a simple yet effective pipeline for stylizing a 3D scene, harnessing the power of 2D image diffusion models. Given a NeRF model reconstructed from a set of multi-view images, we perform 3D style transfer by refining the source NeRF model using stylized images generated by a style-aligned image-to-image diffusion model. Given a target style prompt, we first generate perceptually similar multi-view images by leveraging a depth-conditioned diffusion model with an attention-sharing mechanism. Next, based on the stylized multi-view images, we propose to guide the style transfer process with the sliced Wasserstein loss based on the feature maps extracted from a pre-trained CNN model. Our pipeline consists of decoupled steps, allowing users to test various prompt ideas and preview the stylized 3D result before proceeding to the NeRF fine-tuning stage. We demonstrate that our method can transfer diverse artistic styles to real-world 3D scenes with competitive quality.
- [234] arXiv:2406.13396 [pdf, other]
-
Title: Safe and Non-Conservative Trajectory Planning for Autonomous Driving Handling Unanticipated Behaviors of Traffic ParticipantsSubjects: Systems and Control (eess.SY)
Trajectory planning for autonomous driving is challenging because the unknown future motion of traffic participants must be accounted for, yielding large uncertainty. Stochastic Model Predictive Control (SMPC)-based planners provide non-conservative planning, but do not rule out a (small) probability of collision. We propose a control scheme that yields an efficient trajectory based on SMPC when the traffic scenario allows, still avoiding that the vehicle causes collisions with traffic participants if the latter move according to the prediction assumptions. If some traffic participant does not behave as anticipated, no safety guarantee can be given. Then, our approach yields a trajectory which minimizes the probability of collision, using Constraint Violation Probability Minimization techniques. Our algorithm can also be adapted to minimize the anticipated harm caused by a collision. We provide a thorough discussion of the benefits of our novel control scheme and compare it to a previous approach through numerical simulations from the CommonRoad database.
- [235] arXiv:2406.13397 [pdf, html, other]
-
Title: MoreHopQA: More Than Multi-hop ReasoningComments: 8 pages, 5 figures. First three authors contributed equallySubjects: Computation and Language (cs.CL)
Most existing multi-hop datasets are extractive answer datasets, where the answers to the questions can be extracted directly from the provided context. This often leads models to use heuristics or shortcuts instead of performing true multi-hop reasoning. In this paper, we propose a new multi-hop dataset, MoreHopQA, which shifts from extractive to generative answers. Our dataset is created by utilizing three existing multi-hop datasets: HotpotQA, 2WikiMultihopQA, and MuSiQue. Instead of relying solely on factual reasoning, we enhance the existing multi-hop questions by adding another layer of questioning that involves one, two, or all three of the following types of reasoning: commonsense, arithmetic, and symbolic. Our dataset is created through a semi-automated process, resulting in a dataset with 1,118 samples that have undergone human verification. We then use our dataset to evaluate five different large language models: Mistral 7B, Gemma 7B, Llama 3 (8B and 70B), and GPT-4. We also design various cases to analyze the reasoning steps in the question-answering process. Our results show that models perform well on initial multi-hop questions but struggle with our extended questions, indicating that our dataset is more challenging than previous ones. Our analysis of question decomposition reveals that although models can correctly answer questions, only a portion - 38.7% for GPT-4 and 33.4% for Llama3-70B - achieve perfect reasoning, where all corresponding sub-questions are answered correctly. Evaluation code and data are available at this https URL
- [236] arXiv:2406.13399 [pdf, html, other]
-
Title: VELO: A Vector Database-Assisted Cloud-Edge Collaborative LLM QoS Optimization FrameworkComments: to be published in IEEE ICWS 2024Subjects: Artificial Intelligence (cs.AI)
The Large Language Model (LLM) has gained significant popularity and is extensively utilized across various domains. Most LLM deployments occur within cloud data centers, where they encounter substantial response delays and incur high costs, thereby impacting the Quality of Services (QoS) at the network edge. Leveraging vector database caching to store LLM request results at the edge can substantially mitigate response delays and cost associated with similar requests, which has been overlooked by previous research. Addressing these gaps, this paper introduces a novel Vector database-assisted cloud-Edge collaborative LLM QoS Optimization (VELO) framework. Firstly, we propose the VELO framework, which ingeniously employs vector database to cache the results of some LLM requests at the edge to reduce the response time of subsequent similar requests. Diverging from direct optimization of the LLM, our VELO framework does not necessitate altering the internal structure of LLM and is broadly applicable to diverse LLMs. Subsequently, building upon the VELO framework, we formulate the QoS optimization problem as a Markov Decision Process (MDP) and devise an algorithm grounded in Multi-Agent Reinforcement Learning (MARL) to decide whether to request the LLM in the cloud or directly return the results from the vector database at the edge. Moreover, to enhance request feature extraction and expedite training, we refine the policy network of MARL and integrate expert demonstrations. Finally, we implement the proposed algorithm within a real edge system. Experimental findings confirm that our VELO framework substantially enhances user satisfaction by concurrently diminishing delay and resource consumption for edge users utilizing LLMs.
- [237] arXiv:2406.13404 [pdf, html, other]
-
Title: Low-Latency Layer-Aware Proactive and Passive Container Migration in Meta ComputingComments: to be published in IEEE ICMC 2024Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Meta computing is a new computing paradigm that aims to efficiently utilize all network computing resources to provide fault-tolerant, personalized services with strong security and privacy guarantees. It also seeks to virtualize the Internet as many meta computers. In meta computing, tasks can be assigned to containers at edge nodes for processing, based on container images with multiple layers. The dynamic and resource-constrained nature of meta computing environments requires an optimal container migration strategy for mobile users to minimize latency. However, the problem of container migration in meta computing has not been thoroughly explored. To address this gap, we present low-latency, layer-aware container migration strategies that consider both proactive and passive migration. Specifically: 1) We formulate the container migration problem in meta computing, taking into account layer dependencies to reduce migration costs and overall task duration by considering four delays. 2) We introduce a reinforcement learning algorithm based on policy gradients to minimize total latency by identifying layer dependencies for action selection, making decisions for both proactive and passive migration. Expert demonstrations are introduced to enhance exploitation. 3) Experiments using real data trajectories show that the algorithm outperforms baseline algorithms, achieving lower total latency.
- [238] arXiv:2406.13408 [pdf, html, other]
-
Title: SQLFixAgent: Towards Semantic-Accurate SQL Generation via Multi-Agent CollaborationSubjects: Computation and Language (cs.CL)
While fine-tuned large language models (LLMs) excel in generating grammatically valid SQL in Text-to-SQL parsing, they often struggle to ensure semantic accuracy in queries, leading to user confusion and diminished system usability. To tackle this challenge, we introduce SQLFixAgent, an innovative multi-agent collaborative framework designed for detecting and repairing erroneous SQL. Our framework comprises a core agent, SQLRefiner, alongside two auxiliary agents: SQLReviewer and QueryCrafter. The SQLReviewer agent employs the rubber duck debugging method to identify potential semantic mismatches between SQL statement and user query. If the error is detected, the QueryCrafter agent generates multiple SQL statements as candidate repairs using a fine-tuned SQLTool. Subsequently, leveraging similar repair retrieval and failure memory reflexion, the SQLRefiner agent selects the most fitting SQL statement from the candidates as the final repair. We evaluated our proposed framework on five Text-to-SQL benchmarks. The experimental results show that our method consistently enhances the performance of the baseline model, specifically achieving an execution accuracy improvement of over 3\% on the Bird benchmark. Our framework also has a higher token efficiency compared to other advanced methods, making it more competitive.
- [239] arXiv:2406.13409 [pdf, html, other]
-
Title: PetalView: Fine-grained Location and Orientation Extraction of Street-view Images via Cross-view Local Search with Supplementary MaterialsWenmiao Hu, Yichen Zhang, Yuxuan Liang, Xianjing Han, Yifang Yin, Hannes Kruppa, See-Kiong Ng, Roger ZimmermannComments: This paper has been accepted by ACM Multimedia 2023. This version contains additional supplementary materialsJournal-ref: Proceedings of the 31st ACM International Conference on Multimedia (2023) 56-66Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Satellite-based street-view information extraction by cross-view matching refers to a task that extracts the location and orientation information of a given street-view image query by using one or multiple geo-referenced satellite images. Recent work has initiated a new research direction to find accurate information within a local area covered by one satellite image centered at a location prior (e.g., from GPS). It can be used as a standalone solution or complementary step following a large-scale search with multiple satellite candidates. However, these existing works require an accurate initial orientation (angle) prior (e.g., from IMU) and/or do not efficiently search through all possible poses. To allow efficient search and to give accurate prediction regardless of the existence or the accuracy of the angle prior, we present PetalView extractors with multi-scale search. The PetalView extractors give semantically meaningful features that are equivalent across two drastically different views, and the multi-scale search strategy efficiently inspects the satellite image from coarse to fine granularity to provide sub-meter and sub-degree precision extraction. Moreover, when an angle prior is given, we propose a learnable prior angle mixer to utilize this information. Our method obtains the best performance on the VIGOR dataset and successfully improves the performance on KITTI dataset test 1 set with the recall within 1 meter (r@1m) for location estimation to 68.88% and recall within 1 degree (r@1d) 21.10% when no angle prior is available, and with angle prior achieves stable estimations at r@1m and r@1d above 70% and 21%, up to a 40-degree noise level.
- [240] arXiv:2406.13411 [pdf, html, other]
-
Title: Composite Concept Extraction through BackdooringSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Learning composite concepts, such as \textquotedbl red car\textquotedbl , from individual examples -- like a white car representing the concept of \textquotedbl car\textquotedbl{} and a red strawberry representing the concept of \textquotedbl red\textquotedbl -- is inherently challenging. This paper introduces a novel method called Composite Concept Extractor (CoCE), which leverages techniques from traditional backdoor attacks to learn these composite concepts in a zero-shot setting, requiring only examples of individual concepts. By repurposing the trigger-based model backdooring mechanism, we create a strategic distortion in the manifold of the target object (e.g., \textquotedbl car\textquotedbl ) induced by example objects with the target property (e.g., \textquotedbl red\textquotedbl ) from objects \textquotedbl red strawberry\textquotedbl , ensuring the distortion selectively affects the target objects with the target property. Contrastive learning is then employed to further refine this distortion, and a method is formulated for detecting objects that are influenced by the distortion. Extensive experiments with in-depth analysis across different datasets demonstrate the utility and applicability of our proposed approach.
- [241] arXiv:2406.13414 [pdf, html, other]
-
Title: Archive-based Single-Objective Evolutionary Algorithms for Submodular OptimizationComments: To appear at PPSN 2024Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
Constrained submodular optimization problems play a key role in the area of combinatorial optimization as they capture many NP-hard optimization problems. So far, Pareto optimization approaches using multi-objective formulations have been shown to be successful to tackle these problems while single-objective formulations lead to difficulties for algorithms such as the $(1+1)$-EA due to the presence of local optima. We introduce for the first time single-objective algorithms that are provably successful for different classes of constrained submodular maximization problems. Our algorithms are variants of the $(1+\lambda)$-EA and $(1+1)$-EA and increase the feasible region of the search space incrementally in order to deal with the considered submodular problems.
- [242] arXiv:2406.13415 [pdf, html, other]
-
Title: Factual Confidence of LLMs: on Reliability and Robustness of Current EstimatorsComments: accepted on the main track of ACL 2024Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Large Language Models (LLMs) tend to be unreliable in the factuality of their answers. To address this problem, NLP researchers have proposed a range of techniques to estimate LLM's confidence over facts. However, due to the lack of a systematic comparison, it is not clear how the different methods compare to one another. To fill this gap, we present a survey and empirical comparison of estimators of factual confidence. We define an experimental framework allowing for fair comparison, covering both fact-verification and question answering. Our experiments across a series of LLMs indicate that trained hidden-state probes provide the most reliable confidence estimates, albeit at the expense of requiring access to weights and training data. We also conduct a deeper assessment of factual confidence by measuring the consistency of model behavior under meaning-preserving variations in the input. We find that the confidence of LLMs is often unstable across semantically equivalent inputs, suggesting that there is much room for improvement of the stability of models' parametric knowledge. Our code is available at (this https URL).
- [243] arXiv:2406.13419 [pdf, html, other]
-
Title: An eight-neuron network for quadruped locomotion with hip-knee joint controlSubjects: Robotics (cs.RO)
The gait generator, which is capable of producing rhythmic signals for coordinating multiple joints, is an essential component in the quadruped robot locomotion control framework. The biological counterpart of the gait generator is the Central Pattern Generator (abbreviated as CPG), a small neural network consisting of interacting neurons. Inspired by this architecture, researchers have designed artificial neural networks composed of simulated neurons or oscillator equations. Despite the widespread application of these designed CPGs in various robot locomotion controls, some issues remain unaddressed, including: (1) Simplistic network designs often overlook the symmetry between signal and network structure, resulting in fewer gait patterns than those found in nature. (2) Due to minimal architectural consideration, quadruped control CPGs typically consist of only four neurons, which restricts the network's direct control to leg phases rather than joint coordination. (3) Gait changes are achieved by varying the neuron couplings or the assignment between neurons and legs, rather than through external stimulation. We apply symmetry theory to design an eight-neuron network, composed of Stein neuronal models, capable of achieving five gaits and coordinated control of the hip-knee joints. We validate the signal stability of this network as a gait generator through numerical simulations, which reveal various results and patterns encountered during gait transitions using neuronal stimulation. Based on these findings, we have developed several successful gait transition strategies through neuronal stimulations. Using a commercial quadruped robot model, we demonstrate the usability and feasibility of this network by implementing motion control and gait transitions.
- [244] arXiv:2406.13420 [pdf, html, other]
-
Title: The effect of control barrier functions on energy transfers in controlled physical systemsSubjects: Systems and Control (eess.SY)
Using a port-Hamiltonian formalism, we show the qualitative and quantitative effect of safety-critical control implemented with control barrier functions (CBFs) on the power balance of controlled physical systems. The presented results will provide novel tools to design CBFs inducing desired energetic behaviors of the closed-loop system, including nontrivial damping injection effects and non-passive control actions, effectively injecting energy in the system in a controlled manner. Simulations validate the stated results.
- [245] arXiv:2406.13424 [pdf, html, other]
-
Title: Towards a multimodal framework for remote sensing image change retrieval and captioningSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Recently, there has been increasing interest in multimodal applications that integrate text with other modalities, such as images, audio and video, to facilitate natural language interactions with multimodal AI systems. While applications involving standard modalities have been extensively explored, there is still a lack of investigation into specific data modalities such as remote sensing (RS) data. Despite the numerous potential applications of RS data, including environmental protection, disaster monitoring and land planning, available solutions are predominantly focused on specific tasks like classification, captioning and retrieval. These solutions often overlook the unique characteristics of RS data, such as its capability to systematically provide information on the same geographical areas over time. This ability enables continuous monitoring of changes in the underlying landscape. To address this gap, we propose a novel foundation model for bi-temporal RS image pairs, in the context of change detection analysis, leveraging Contrastive Learning and the LEVIR-CC dataset for both captioning and text-image retrieval. By jointly training a contrastive encoder and captioning decoder, our model add text-image retrieval capabilities, in the context of bi-temporal change detection, while maintaining captioning performances that are comparable to the state of the art. We release the source code and pretrained weights at: this https URL.
- [246] arXiv:2406.13427 [pdf, other]
-
Title: Are Logistic Models Really Interpretable?Comments: 36 pages, 5 Figures. Extended version of paper accepted to IJCAI 2024. arXiv admin note: substantial text overlap with arXiv:2211.06360Subjects: Machine Learning (cs.LG)
The demand for open and trustworthy AI models points towards widespread publishing of model weights. Consumers of these model weights must be able to act accordingly with the information provided. That said, one of the simplest AI classification models, Logistic Regression (LR), has an unwieldy interpretation of its model weights, with greater difficulties when extending LR to generalised additive models. In this work, we show via a User Study that skilled participants are unable to reliably reproduce the action of small LR models given the trained parameters. As an antidote to this, we define Linearised Additive Models (LAMs), an optimal piecewise linear approximation that augments any trained additive model equipped with a sigmoid link function, requiring no retraining. We argue that LAMs are more interpretable than logistic models -- survey participants are shown to solve model reasoning tasks with LAMs much more accurately than with LR given the same information. Furthermore, we show that LAMs do not suffer from large performance penalties in terms of ROC-AUC and calibration with respect to their logistic counterparts on a broad suite of public financial modelling data.
- [247] arXiv:2406.13431 [pdf, html, other]
-
Title: Children's Speech Recognition through Discrete Token EnhancementComments: Accepted at Interspeech 2024Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Children's speech recognition is considered a low-resource task mainly due to the lack of publicly available data. There are several reasons for such data scarcity, including expensive data collection and annotation processes, and data privacy, among others. Transforming speech signals into discrete tokens that do not carry sensitive information but capture both linguistic and acoustic information could be a solution for privacy concerns. In this study, we investigate the integration of discrete speech tokens into children's speech recognition systems as input without significantly degrading the ASR performance. Additionally, we explored single-view and multi-view strategies for creating these discrete labels. Furthermore, we tested the models for generalization capabilities with unseen domain and nativity dataset. Results reveal that the discrete token ASR for children achieves nearly equivalent performance with an approximate 83% reduction in parameters.
- [248] arXiv:2406.13433 [pdf, html, other]
-
Title: Certificates of Differential Privacy and Unlearning for Gradient-Based TrainingComments: 15 pages, 14 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Proper data stewardship requires that model owners protect the privacy of individuals' data used during training. Whether through anonymization with differential privacy or the use of unlearning in non-anonymized settings, the gold-standard techniques for providing privacy guarantees can come with significant performance penalties or be too weak to provide practical assurances. In part, this is due to the fact that the guarantee provided by differential privacy represents the worst-case privacy leakage for any individual, while the true privacy leakage of releasing the prediction for a given individual might be substantially smaller or even, as we show, non-existent. This work provides a novel framework based on convex relaxations and bounds propagation that can compute formal guarantees (certificates) that releasing specific predictions satisfies $\epsilon=0$ privacy guarantees or do not depend on data that is subject to an unlearning request. Our framework offers a new verification-centric approach to privacy and unlearning guarantees, that can be used to further engender user trust with tighter privacy guarantees, provide formal proofs of robustness to certain membership inference attacks, identify potentially vulnerable records, and enhance current unlearning approaches. We validate the effectiveness of our approach on tasks from financial services, medical imaging, and natural language processing.
- [249] arXiv:2406.13434 [pdf, html, other]
-
Title: Tactile Aware Dynamic Obstacle Avoidance in Crowded Environment with Deep Reinforcement LearningSubjects: Robotics (cs.RO)
Mobile robots operating in crowded environments require the ability to navigate among humans and surrounding obstacles efficiently while adhering to safety standards and socially compliant mannerisms. This scale of the robot navigation problem may be classified as both a local path planning and trajectory optimization problem. This work presents an array of force sensors that act as a tactile layer to complement the use of a LiDAR for the purpose of inducing awareness of contact with any surrounding objects within immediate vicinity of a mobile robot undetected by LiDARs. By incorporating the tactile layer, the robot can take more risks in its movements and possibly go right up to an obstacle or wall, and gently squeeze past it. In addition, we built up a simulation platform via Pybullet which integrates Robot Operating System (ROS) and reinforcement learning (RL) together. A touch-aware neural network model was trained on it to create an RL-based local path planner for dynamic obstacle avoidance. Our proposed method was demonstrated successfully on an omni-directional mobile robot who was able to navigate in a crowded environment with high agility and versatility in movement, while not being overly sensitive to nearby obstacles-not-in-contact.
- [250] arXiv:2406.13436 [pdf, html, other]
-
Title: What's Next? Exploring Utilization, Challenges, and Future Directions of AI-Generated Image Tools in Graphic DesignSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
Recent advancements in artificial intelligence, such as computer vision and deep learning, have led to the emergence of numerous generative AI platforms, particularly for image generation. However, the application of AI-generated image tools in graphic design has not been extensively explored. This study conducted semi-structured interviews with seven designers of varying experience levels to understand their current usage, challenges, and future functional needs for AI-generated image tools in graphic design. As our findings suggest, AI tools serve as creative partners in design, enhancing human creativity, offering strategic insights, and fostering team collaboration and communication. The findings provide guiding recommendations for the future development of AI-generated image tools, aimed at helping engineers optimize these tools to better meet the needs of graphic designers.
- [251] arXiv:2406.13437 [pdf, html, other]
-
Title: MsFEM for advection-dominated problems in heterogeneous media: Stabilization via nonconforming variantsSubjects: Numerical Analysis (math.NA)
We study the numerical approximation of advection-diffusion equations with highly oscillatory coefficients and possibly dominant advection terms by means of the Multiscale Finite Element Method. The latter method is a now classical, finite element type method that performs a Galerkin approximation on a problem-dependent basis set, itself pre-computed in an offline stage. The approach is implemented here using basis functions that locally resolve both the diffusion and the advection terms. Variants with additional bubble functions and possibly weak inter-element continuity are proposed. Some theoretical arguments and a comprehensive set of numerical experiments allow to investigate and compare the stability and the accuracy of the approaches. The best approach constructed is shown to be adequate for both the diffusion- and advection-dominated regimes, and does not rely on an auxiliary stabilization parameter that would have to be properly adjusted.
- [252] arXiv:2406.13439 [pdf, html, other]
-
Title: Finding Blind Spots in Evaluator LLMs with Interpretable ChecklistsSubjects: Computation and Language (cs.CL)
Large Language Models (LLMs) are increasingly relied upon to evaluate text outputs of other LLMs, thereby influencing leaderboards and development decisions. However, concerns persist over the accuracy of these assessments and the potential for misleading conclusions. In this work, we investigate the effectiveness of LLMs as evaluators for text generation tasks. We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities in other LLMs: factual accuracy, instruction following, coherence in long-form writing, and reasoning proficiency. By introducing targeted perturbations in answers generated by LLMs, that clearly impact one of these key capabilities, we test whether an Evaluator LLM can detect these quality drops. By creating a total of 2400 perturbed answers covering 22 perturbation categories, we conduct a comprehensive study using different evaluation strategies on five prominent LLMs commonly used as evaluators in the literature. Our findings reveal significant shortcomings in current Evaluator LLMs, which failed to identify quality drops in over 50\% of cases on average. Single-answer and pairwise evaluations demonstrated notable limitations, whereas reference-based evaluations showed comparatively better performance. These results underscore the unreliable nature of current Evaluator LLMs and advocate for cautious implementation in practical applications. Code and data are available at this https URL.
- [253] arXiv:2406.13443 [pdf, html, other]
-
Title: Dual-Phase Accelerated Prompt OptimizationSubjects: Computation and Language (cs.CL)
Gradient-free prompt optimization methods have made significant strides in enhancing the performance of closed-source Large Language Models (LLMs) across a wide range of tasks. However, existing approaches make light of the importance of high-quality prompt initialization and the identification of effective optimization directions, thus resulting in substantial optimization steps to obtain satisfactory performance. In this light, we aim to accelerate prompt optimization process to tackle the challenge of low convergence rate. We propose a dual-phase approach which starts with generating high-quality initial prompts by adopting a well-designed meta-instruction to delve into task-specific information, and iteratively optimize the prompts at the sentence level, leveraging previous tuning experience to expand prompt candidates and accept effective ones. Extensive experiments on eight datasets demonstrate the effectiveness of our proposed method, achieving a consistent accuracy gain over baselines with less than five optimization steps.
- [254] arXiv:2406.13444 [pdf, html, other]
-
Title: VDebugger: Harnessing Execution Feedback for Debugging Visual ProgramsSubjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Visual programs are executable code generated by large language models to address visual reasoning problems. They decompose complex questions into multiple reasoning steps and invoke specialized models for each step to solve the problems. However, these programs are prone to logic errors, with our preliminary evaluation showing that 58% of the total errors are caused by program logic errors. Debugging complex visual programs remains a major bottleneck for visual reasoning. To address this, we introduce VDebugger, a novel critic-refiner framework trained to localize and debug visual programs by tracking execution step by step. VDebugger identifies and corrects program errors leveraging detailed execution feedback, improving interpretability and accuracy. The training data is generated through an automated pipeline that injects errors into correct visual programs using a novel mask-best decoding technique. Evaluations on six datasets demonstrate VDebugger's effectiveness, showing performance improvements of up to 3.2% in downstream task accuracy. Further studies show VDebugger's ability to generalize to unseen tasks, bringing a notable improvement of 2.3% on the unseen COVR task. Code, data and models are made publicly available at this https URL
- [255] arXiv:2406.13445 [pdf, html, other]
-
Title: Lost in UNet: Improving Infrared Small Target Detection by Underappreciated Local FeaturesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Many targets are often very small in infrared images due to the long-distance imaging meachnism. UNet and its variants, as popular detection backbone networks, downsample the local features early and cause the irreversible loss of these local features, leading to both the missed and false detection of small targets in infrared images. We propose HintU, a novel network to recover the local features lost by various UNet-based methods for effective infrared small target detection. HintU has two key contributions. First, it introduces the "Hint" mechanism for the first time, i.e., leveraging the prior knowledge of target locations to highlight critical local features. Second, it improves the mainstream UNet-based architecture to preserve target pixels even after downsampling. HintU can shift the focus of various networks (e.g., vanilla UNet, UNet++, UIUNet, MiM+, and HCFNet) from the irrelevant background pixels to a more restricted area from the beginning. Experimental results on three datasets NUDT-SIRST, SIRSTv2 and IRSTD1K demonstrate that HintU enhances the performance of existing methods with only an additional 1.88 ms cost (on RTX Titan). Additionally, the explicit constraints of HintU enhance the generalization ability of UNet-based methods. Code is available at this https URL.
- [256] arXiv:2406.13449 [pdf, html, other]
-
Title: Consensus analysis of a two-step communication opinion dynamics model with group pressure and self-confidenceComments: 7 pages, 15 figuresSubjects: Information Theory (cs.IT)
This paper considers the consensus problem of a novel opinion dynamics model with group pressure and self-confidence. Different with the most existing paper, the influence of friends of friends in a social network is taken into account, which is modeled to be two-step communication. Based on this consideration, the neighbors of agents are classified into direct neighbors and indirect neighbors. Accordingly, the communication between agents and their neighbors is classified into one-step communication and two-step communication. By applying matrix analytic theory and graph theory, it is shown that the opinion consensus can be achieved. Moreover, the exactly consensus value of the opinion is obtained for three cases of the group pressure. Finally, simulation examples are provided to demonstrate the validity of the conclusions drawn in the paper.
- [257] arXiv:2406.13450 [pdf, html, other]
-
Title: Federating to Grow Transformers with Constrained Resources without Model SharingSubjects: Artificial Intelligence (cs.AI)
The high resource consumption of large-scale models discourages resource-constrained users from developing their customized transformers. To this end, this paper considers a federated framework named Fed-Grow for multiple participants to cooperatively scale a transformer from their pre-trained small models. Under the Fed-Grow, a Dual-LiGO (Dual Linear Growth Operator) architecture is designed to help participants expand their pre-trained small models to a transformer. In Dual-LiGO, the Local-LiGO part is used to address the heterogeneity problem caused by the various pre-trained models, and the Global-LiGO part is shared to exchange the implicit knowledge from the pre-trained models, local data, and training process of participants. Instead of model sharing, only sharing the Global-LiGO strengthens the privacy of our approach. Compared with several state-of-the-art methods in simulation, our approach has higher accuracy, better precision, and lower resource consumption on computations and communications. To the best of our knowledge, most of the previous model-scaling works are centralized, and our work is the first one that cooperatively grows a transformer from multiple pre-trained heterogeneous models with the user privacy protected in terms of local data and models. We hope that our approach can extend the transformers to the broadly distributed scenarios and encourage more resource-constrained users to enjoy the bonus taken by the large-scale transformers.
- [258] arXiv:2406.13453 [pdf, html, other]
-
Title: Reinforcement Learning to improve delta robot throws for sorting scrap metalSubjects: Robotics (cs.RO)
This study proposes a novel approach based on reinforcement learning (RL) to enhance the sorting efficiency of scrap metal using delta robots and a Pick-and-Place (PaP) process, widely used in the industry. We use three classical model-free RL algorithms (TD3, SAC and PPO) to reduce the time to sort metal scraps. We learn the release position and speed needed to throw an object in a bin instead of moving to the exact bin location, as with the classical PaP technique. Our contribution is threefold. First, we provide a new simulation environment for learning RL-based Pick-and-Throw (PaT) strategies for parallel grippers. Second, we use RL algorithms for learning this task in this environment resulting in 89% accuracy while speeding up the throughput by 51% in simulation. Third, we evaluate the performances of RL algorithms and compare them to a PaP and a state-of-the-art PaT method both in simulation and reality, learning only from simulation with domain randomisation and without fine tuning in reality to transfer our policies. This work shows the benefits of RL-based PaT compared to PaP or classical optimization PaT techniques used in the industry.
- [259] arXiv:2406.13457 [pdf, html, other]
-
Title: EvTexture: Event-driven Texture Enhancement for Video Super-ResolutionComments: ICML 2024. Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Event-based vision has drawn increasing attention due to its unique characteristics, such as high temporal resolution and high dynamic range. It has been used in video super-resolution (VSR) recently to enhance the flow estimation and temporal alignment. Rather than for motion learning, we propose in this paper the first VSR method that utilizes event signals for texture enhancement. Our method, called EvTexture, leverages high-frequency details of events to better recover texture regions in VSR. In our EvTexture, a new texture enhancement branch is presented. We further introduce an iterative texture enhancement module to progressively explore the high-temporal-resolution event information for texture restoration. This allows for gradual refinement of texture regions across multiple iterations, leading to more accurate and rich high-resolution details. Experimental results show that our EvTexture achieves state-of-the-art performance on four datasets. For the Vid4 dataset with rich textures, our method can get up to 4.67dB gain compared with recent event-based methods. Code: this https URL.
- [260] arXiv:2406.13462 [pdf, other]
-
Title: Design of Phase Locked Loop in 180 nm TechnologySubjects: Systems and Control (eess.SY)
The presented paper introduces a design for a phase-locked loop (PLL) that is utilized in frequency synthesis and modulation-demodulation within communication systems and in VLSI applications. The CMOS PLL is designed using 180 nm Fabrication Technology on Cadence Virtuoso Tool with a supply voltage of 1.8 V. The performance is evaluated through simulations and measurements, which demonstrate its ability to track and lock onto the input frequency. The PLL is a frequency synthesizer implemented to generate 2.4 GHz frequency. The input reference clock from a crystal oscillator is 150 MHz square wave. Negative feedback is given by divide-by-16 frequency divider, ensuring the phase and frequency synchronization between the divided signal and the reference signal. The design has essential components such as a phase frequency detector, charge pump, loop filter, current-starved voltage-controlled oscillator (CSVCO), and frequency divider. Through their collaborative operation, the system generates an output frequency that is 16 times the input frequency. The centre frequency of the 3-stage CSVCO is 3.208 GHz at 900 mV input voltage. With an input voltage ranging from 0.4 V to 1.8 V, the VCO offers a tuning range that spans from 1.066 GHz to 3.731 GHz.PLL demonstrates a lock-in range spanning from 70.4 MHz to 173 MHz, with an output frequency range of 1.12 GHz to 2.78 GHz. It achieves a lock time of 260.03 ns and consumes a maximum power of 5.15 mW at 2.4 GHz.
- [261] arXiv:2406.13466 [pdf, other]
-
Title: Unveiling Covert Semantics: Joint Source-Channel Coding Under a Covertness ConstraintComments: Submitted to GLOBECOM'2024 for reviewSubjects: Information Theory (cs.IT)
The fundamental limit of Semantic Communications (joint source-channel coding) is established when the transmission needs to be kept covert from an external warden. We derive information-theoretic achievability and matching converse results and we show that source and channel coding separation holds for this setup. Furthermore, we show through an experimental setup that one can train a deep neural network to achieve covert semantic communication for the classification task. Our numerical experiments confirm our theoretical findings, which indicate that for reliable joint source-channel coding the number of transmitted source symbols can only scale as the square-root of the number of channel uses.
- [262] arXiv:2406.13469 [pdf, html, other]
-
Title: Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU TasksComments: 14 pages, 2 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
This paper explores the performance of encoder and decoder language models on multilingual Natural Language Understanding (NLU) tasks, with a broad focus on Germanic languages. Building upon the ScandEval benchmark, which initially was restricted to evaluating encoder models, we extend the evaluation framework to include decoder models. We introduce a method for evaluating decoder models on NLU tasks and apply it to the languages Danish, Swedish, Norwegian, Icelandic, Faroese, German, Dutch, and English. Through a series of experiments and analyses, we address key research questions regarding the comparative performance of encoder and decoder models, the impact of NLU task types, and the variation across language resources. Our findings reveal that decoder models can achieve significantly better NLU performance than encoder models, with nuances observed across different tasks and languages. Additionally, we investigate the correlation between decoders and task performance via a UMAP analysis, shedding light on the unique capabilities of decoder and encoder models. This study contributes to a deeper understanding of language model paradigms in NLU tasks and provides valuable insights for model selection and evaluation in multilingual settings.
- [263] arXiv:2406.13473 [pdf, html, other]
-
Title: Snowy Scenes,Clear Detections: A Robust Model for Traffic Light Detection in Adverse Weather ConditionsSubjects: Computer Vision and Pattern Recognition (cs.CV)
With the rise of autonomous vehicles and advanced driver-assistance systems (ADAS), ensuring reliable object detection in all weather conditions is crucial for safety and efficiency. Adverse weather like snow, rain, and fog presents major challenges for current detection systems, often resulting in failures and potential safety risks. This paper introduces a novel framework and pipeline designed to improve object detection under such conditions, focusing on traffic signal detection where traditional methods often fail due to domain shifts caused by adverse weather. We provide a comprehensive analysis of the limitations of existing techniques. Our proposed pipeline significantly enhances detection accuracy in snow, rain, and fog. Results show a 40.8% improvement in average IoU and F1 scores compared to naive fine-tuning and a 22.4% performance increase in domain shift scenarios, such as training on artificial snow and testing on rain images.
- [264] arXiv:2406.13474 [pdf, html, other]
-
Title: Attention-aware Post-training Quantization without BackpropagationComments: 20 pages, under reviewSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Quantization is a promising solution for deploying large-scale language models (LLMs) on resource-constrained devices. Existing quantization approaches, however, rely on gradient-based optimization, regardless of it being post-training quantization (PTQ) or quantization-aware training (QAT), which becomes problematic for hyper-scale LLMs with billions of parameters. This overhead can be alleviated via recently proposed backpropagation-free PTQ methods; however, their performance is somewhat limited by their lack of consideration of inter-layer dependencies. In this paper, we thus propose a novel PTQ algorithm that considers inter-layer dependencies without relying on backpropagation. The fundamental concept involved is the development of attention-aware Hessian matrices, which facilitates the consideration of inter-layer dependencies within the attention module. Extensive experiments demonstrate that the proposed algorithm significantly outperforms conventional PTQ methods, particularly for low bit-widths.
- [265] arXiv:2406.13476 [pdf, html, other]
-
Title: LLMs Are Zero-Shot Context-Aware Simultaneous TranslatorsSubjects: Computation and Language (cs.CL)
The advent of transformers has fueled progress in machine translation. More recently large language models (LLMs) have come to the spotlight thanks to their generality and strong performance in a wide range of language tasks, including translation. Here we show that open-source LLMs perform on par with or better than some state-of-the-art baselines in simultaneous machine translation (SiMT) tasks, zero-shot. We also demonstrate that injection of minimal background information, which is easy with an LLM, brings further performance gains, especially on challenging technical subject-matter. This highlights LLMs' potential for building next generation of massively multilingual, context-aware and terminologically accurate SiMT systems that require no resource-intensive training or fine-tuning.
- [266] arXiv:2406.13477 [pdf, other]
-
Title: An extension of the low-rank Lyapunov ADI to non-zero initial values and its applicationsSubjects: Numerical Analysis (math.NA)
We derive the Alternating-Direction Implicit (ADI) method based on a commuting operator split and apply the results to the continuous time algebraic Lyapunov equation with low-rank constant term and approximate solution. Previously, it has been mandatory to start the low-rank ADI (LR-ADI) with an all-zero initial value. Our approach retains the known efficient iteration schemes of low-rank increments and residual to arbitrary low-rank initial values for the LR-ADI method. We further generalize some of the known properties of the LR-ADI for Lyapunov equations to larger classes of algorithms or problems.
We investigate the performance of arbitrary initial values using two outer iterations in which LR-ADI is typically called. First, we solve an algebraic Riccati equation with the Newton method. Second, we solve a differential Riccati equation with a first-order Rosenbrock method. Numerical experiments confirm that the proposed new initial value of the alternating-directions implicit (ADI) can lead to a significant reduction in the total number of ADI steps, while also showing a 17% and 8x speed-up over the zero initial value for the two equation types, respectively. - [267] arXiv:2406.13482 [pdf, html, other]
-
Title: Estimating Map Completeness in Robot ExplorationComments: Under review at IEEE RALSubjects: Robotics (cs.RO)
In this paper, we propose a method that, given a partial grid map of an indoor environment built by an autonomous mobile robot, estimates the amount of the explored area represented in the map, as well as whether the uncovered part is still worth being explored or not. Our method is based on a deep convolutional neural network trained on data from partially explored environments with annotations derived from the knowledge of the entire map (which is not available when the network is used for inference). We show how such a network can be used to define a stopping criterion to terminate the exploration process when it is no longer adding relevant details about the environment to the map, saving, on average, 40% of the total exploration time with respect to covering all the area of the environment.
- [268] arXiv:2406.13487 [pdf, html, other]
-
Title: An evidential time-to-event prediction model based on Gaussian random fuzzy numbersJournal-ref: BELIEF2024Subjects: Machine Learning (cs.LG)
We introduce an evidential model for time-to-event prediction with censored data. In this model, uncertainty on event time is quantified by Gaussian random fuzzy numbers, a newly introduced family of random fuzzy subsets of the real line with associated belief functions, generalizing both Gaussian random variables and Gaussian possibility distributions. Our approach makes minimal assumptions about the underlying time-to-event distribution. The model is fit by minimizing a generalized negative log-likelihood function that accounts for both normal and censored data. Comparative experiments on two real-world datasets demonstrate the very good performance of our model as compared to the state-of-the-art.
- [269] arXiv:2406.13490 [pdf, html, other]
-
Title: The Surprising Benefits of Base Rate Neglect in Robust AggregationSubjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
Robust aggregation integrates predictions from multiple experts without knowledge of the experts' information structures. Prior work assumes experts are Bayesian, providing predictions as perfect posteriors based on their signals. However, real-world experts often deviate systematically from Bayesian reasoning. Our work considers experts who tend to ignore the base rate. We find that a certain degree of base rate neglect helps with robust forecast aggregation.
Specifically, we consider a forecast aggregation problem with two experts who each predict a binary world state after observing private signals. Unlike previous work, we model experts exhibiting base rate neglect, where they incorporate the base rate information to degree $\lambda\in[0,1]$, with $\lambda=0$ indicating complete ignorance and $\lambda=1$ perfect Bayesian updating. To evaluate aggregators' performance, we adopt Arieli et al. (2018)'s worst-case regret model, which measures the maximum regret across the set of considered information structures compared to an omniscient benchmark. Our results reveal the surprising V-shape of regret as a function of $\lambda$. That is, predictions with an intermediate incorporating degree of base rate $\lambda<1$ can counter-intuitively lead to lower regret than perfect Bayesian posteriors with $\lambda=1$. We additionally propose a new aggregator with low regret robust to unknown $\lambda$. Finally, we conduct an empirical study to test the base rate neglect model and evaluate the performance of various aggregators. - [270] arXiv:2406.13493 [pdf, other]
-
Title: In-Context In-Context Learning with Transformer Neural ProcessesSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Neural processes (NPs) are a powerful family of meta-learning models that seek to approximate the posterior predictive map of the ground-truth stochastic process from which each dataset in a meta-dataset is sampled. There are many cases in which practitioners, besides having access to the dataset of interest, may also have access to other datasets that share similarities with it. In this case, integrating these datasets into the NP can improve predictions. We equip NPs with this functionality and describe this paradigm as in-context in-context learning. Standard NP architectures, such as the convolutional conditional NP (ConvCNP) or the family of transformer neural processes (TNPs), are not capable of in-context in-context learning, as they are only able to condition on a single dataset. We address this shortcoming by developing the in-context in-context learning pseudo-token TNP (ICICL-TNP). The ICICL-TNP builds on the family of PT-TNPs, which utilise pseudo-token-based transformer architectures to sidestep the quadratic computational complexity associated with regular transformer architectures. Importantly, the ICICL-TNP is capable of conditioning on both sets of datapoints and sets of datasets, enabling it to perform in-context in-context learning. We demonstrate the importance of in-context in-context learning and the effectiveness of the ICICL-TNP in a number of experiments.
- [271] arXiv:2406.13495 [pdf, html, other]
-
Title: DF40: Toward Next-Generation Deepfake DetectionZhiyuan Yan, Taiping Yao, Shen Chen, Yandan Zhao, Xinghe Fu, Junwei Zhu, Donghao Luo, Li Yuan, Chengjie Wang, Shouhong Ding, Yunsheng WuSubjects: Computer Vision and Pattern Recognition (cs.CV)
We propose a new comprehensive benchmark to revolutionize the current deepfake detection field to the next generation. Predominantly, existing works identify top-notch detection algorithms and models by adhering to the common practice: training detectors on one specific dataset (e.g., FF++) and testing them on other prevalent deepfake datasets. This protocol is often regarded as a "golden compass" for navigating SoTA detectors. But can these stand-out "winners" be truly applied to tackle the myriad of realistic and diverse deepfakes lurking in the real world? If not, what underlying factors contribute to this gap? In this work, we found the dataset (both train and test) can be the "primary culprit" due to: (1) forgery diversity: Deepfake techniques are commonly referred to as both face forgery (face-swapping and face-reenactment) and entire image synthesis (AIGC). Most existing datasets only contain partial types, with limited forgery methods implemented; (2) forgery realism: The dominant training dataset, FF++, contains old forgery techniques from the past five years. "Honing skills" on these forgeries makes it difficult to guarantee effective detection of nowadays' SoTA deepfakes; (3) evaluation protocol: Most detection works perform evaluations on one type, e.g., train and test on face-swapping only, which hinders the development of universal deepfake detectors. To address this dilemma, we construct a highly diverse and large-scale deepfake dataset called DF40, which comprises 40 distinct deepfake techniques. We then conduct comprehensive evaluations using 4 standard evaluation protocols and 7 representative detectors, resulting in over 2,000 evaluations. Through these evaluations, we analyze from various perspectives, leading to 12 new insightful findings contributing to the field. We also open up 5 valuable yet previously underexplored research questions to inspire future works.
- [272] arXiv:2406.13498 [pdf, html, other]
-
Title: Semantic Enhanced Few-shot Object DetectionComments: Accepted by ICIP 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Few-shot object detection~(FSOD), which aims to detect novel objects with limited annotated instances, has made significant progress in recent years. However, existing methods still suffer from biased representations, especially for novel classes in extremely low-shot scenarios. During fine-tuning, a novel class may exploit knowledge from similar base classes to construct its own feature distribution, leading to classification confusion and performance degradation. To address these challenges, we propose a fine-tuning based FSOD framework that utilizes semantic embeddings for better detection. In our proposed method, we align the visual features with class name embeddings and replace the linear classifier with our semantic similarity classifier. Our method trains each region proposal to converge to the corresponding class embedding. Furthermore, we introduce a multimodal feature fusion to augment the vision-language communication, enabling a novel class to draw support explicitly from well-trained similar base classes. To prevent class confusion, we propose a semantic-aware max-margin loss, which adaptively applies a margin beyond similar classes. As a result, our method allows each novel class to construct a compact feature space without being confused with similar base classes. Extensive experiments on Pascal VOC and MS COCO demonstrate the superiority of our method.
- [273] arXiv:2406.13499 [pdf, html, other]
-
Title: GraphMU: Repairing Robustness of Graph Neural Networks via Machine UnlearningSubjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG)
Graph Neural Networks (GNNs) have demonstrated significant application potential in various fields. However, GNNs are still vulnerable to adversarial attacks. Numerous adversarial defense methods on GNNs are proposed to address the problem of adversarial attacks. However, these methods can only serve as a defense before poisoning, but cannot repair poisoned GNN. Therefore, there is an urgent need for a method to repair poisoned GNN. In this paper, we address this gap by introducing the novel concept of model repair for GNNs. We propose a repair framework, Repairing Robustness of Graph Neural Networks via Machine Unlearning (GraphMU), which aims to fine-tune poisoned GNN to forget adversarial samples without the need for complete retraining. We also introduce a unlearning validation method to ensure that our approach effectively forget specified poisoned data. To evaluate the effectiveness of GraphMU, we explore three fine-tuned subgraph construction scenarios based on the available perturbation information: (i) Known Perturbation Ratios, (ii) Known Complete Knowledge of Perturbations, and (iii) Unknown any Knowledge of Perturbations. Our extensive experiments, conducted across four citation datasets and four adversarial attack scenarios, demonstrate that GraphMU can effectively restore the performance of poisoned GNN.
- [274] arXiv:2406.13502 [pdf, html, other]
-
Title: ManWav: The First Manchu ASR ModelComments: ACL2024/Field MattersSubjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
This study addresses the widening gap in Automatic Speech Recognition (ASR) research between high resource and extremely low resource languages, with a particular focus on Manchu, a critically endangered language. Manchu exemplifies the challenges faced by marginalized linguistic communities in accessing state-of-the-art technologies. In a pioneering effort, we introduce the first-ever Manchu ASR model ManWav, leveraging Wav2Vec2-XLSR-53. The results of the first Manchu ASR is promising, especially when trained with our augmented data. Wav2Vec2-XLSR-53 fine-tuned with augmented data demonstrates a 0.02 drop in CER and 0.13 drop in WER compared to the same base model fine-tuned with original data.
- [275] arXiv:2406.13505 [pdf, other]
-
Title: Demonstration of low power and highly uniform 6-bit operation in SiO2-based memristors embedded with Pt nanoparticlesG. Kleitsiotis, P. Bousoulas, S. D. Mantas, C. Tsioustas, I. A. Fyrigos, G. Sirakoulis, D. TsoukalasSubjects: Hardware Architecture (cs.AR); Emerging Technologies (cs.ET); Applied Physics (physics.app-ph)
In this work, an optimized method was implemented for attaining stable multibit operation with low energy consumption in a two-terminal memory element made from the following layers: Ag/Pt nanoparticles (NPs)/SiO2/TiN in a 1-Transistor-1-Memristor configuration. Compared to the reference sample where no NPs were embedded, an enlarged memory window was recorded in conjunction with reduced variability for both switching states. A comprehensive numerical model was also applied to shed light on this enhanced performance, which was attributed to the spatial confinement effect induced by the presence of the Pt NPs and its impact on the properties of the percolating conducting filaments (CFs). Although 5-bit precision was demonstrated with the application of the incremental-step-pulse-programming (ISPP) algorithm, the reset process was unreliable and the output current increased abnormally when exceeded the value of 150 uA. As a result, the multibit operation was limited. To address this issue, a modified scheme was developed to accurately control the distance between the various resistance levels and achieve highly reliable 6-bit precision. Our work provides valuable insights for the development of energy-efficient memories for applications where a high density of conductance levels is required.
- [276] arXiv:2406.13507 [pdf, html, other]
-
Title: Scalable unsupervised alignment of general metric and non-metric structuresSubjects: Machine Learning (cs.LG)
Aligning data from different domains is a fundamental problem in machine learning with broad applications across very different areas, most notably aligning experimental readouts in single-cell multiomics. Mathematically, this problem can be formulated as the minimization of disagreement of pair-wise quantities such as distances and is related to the Gromov-Hausdorff and Gromov-Wasserstein distances. Computationally, it is a quadratic assignment problem (QAP) that is known to be NP-hard. Prior works attempted to solve the QAP directly with entropic or low-rank regularization on the permutation, which is computationally tractable only for modestly-sized inputs, and encode only limited inductive bias related to the domains being aligned. We consider the alignment of metric structures formulated as a discrete Gromov-Wasserstein problem and instead of solving the QAP directly, we propose to learn a related well-scalable linear assignment problem (LAP) whose solution is also a minimizer of the QAP. We also show a flexible extension of the proposed framework to general non-metric dissimilarities through differentiable ranks. We extensively evaluate our approach on synthetic and real datasets from single-cell multiomics and neural latent spaces, achieving state-of-the-art performance while being conceptually and computationally simple.
- [277] arXiv:2406.13511 [pdf, html, other]
-
Title: Slice-Level Scheduling for High Throughput and Load Balanced LLM ServingComments: 13 pages, 22 figuresSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Large language models (LLMs) iteratively generate text token by token, with memory usage increasing with the length of generated token sequences. The unpredictability of generation lengths makes it difficult to estimate the time and memory needed to process requests, posing a challenge for effective request scheduling. Conventional sequence-level scheduling (SLS) serves requests in a first-come first-served (FCFS) manner with static batching where requests with short generation lengths are delayed until those with long ones have finished generation, which hurts computational efficiency. Besides, to avoid out-of-memory (OOM) errors, SLS batches requests with a small batch size, which limits throughput. Recently proposed iteration-level scheduling (ILS) enhances computational efficiency with continuous batching to return completed requests timely and dynamically add new requests for processing. However, many ILS schedulers limit the number of parallel-processing requests to avoid OOM errors while achieving a fast inference speed, which compromises throughput. Moreover, existing SLS and ILS schedulers fail to balance the workload across multiple deployed LLM instances. To tackle these challenges, we propose slice-level scheduling (SCLS). By splitting the predefined maximal generation length limit into slices and serving batches slice by slice, it provides a precise range of serving time and memory usage for batched requests, laying the foundation for effective scheduling. Experiments confirm that compared with SLS and ILS schedulers, SCLS can improve throughput by up to 315.8% and greatly mitigate load imbalance with proposed batching and offloading algorithms.
- [278] arXiv:2406.13514 [pdf, html, other]
-
Title: Locally orderless networksComments: 12 pages, 6 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
We present Locally Orderless Networks (LON) and its theoretic foundation which links it to Convolutional Neural Networks (CNN), to Scale-space histograms, and measurement theory. The key elements are a regular sampling of the bias and the derivative of the activation function. We compare LON, CNN, and Scale-space histograms on prototypical single-layer networks. We show how LON and CNN can emulate each other, how LON expands the set of functionals computable to non-linear functions such as squaring. We demonstrate simple networks which illustrate the improved performance of LON over CNN on simple tasks for estimating the gradient magnitude squared, for regressing shape area and perimeter lengths, and for explainability of individual pixels' influence on the result.
- [279] arXiv:2406.13515 [pdf, html, other]
-
Title: MVSBoost: An Efficient Point Cloud-based 3D ReconstructionComments: The work is under reviewSubjects: Computer Vision and Pattern Recognition (cs.CV)
Efficient and accurate 3D reconstruction is crucial for various applications, including augmented and virtual reality, medical imaging, and cinematic special effects. While traditional Multi-View Stereo (MVS) systems have been fundamental in these applications, using neural implicit fields in implicit 3D scene modeling has introduced new possibilities for handling complex topologies and continuous surfaces. However, neural implicit fields often suffer from computational inefficiencies, overfitting, and heavy reliance on data quality, limiting their practical use. This paper presents an enhanced MVS framework that integrates multi-view 360-degree imagery with robust camera pose estimation via Structure from Motion (SfM) and advanced image processing for point cloud densification, mesh reconstruction, and texturing. Our approach significantly improves upon traditional MVS methods, offering superior accuracy and precision as validated using Chamfer distance metrics on the Realistic Synthetic 360 dataset. The developed MVS technique enhances the detail and clarity of 3D reconstructions and demonstrates superior computational efficiency and robustness in complex scene reconstruction, effectively handling occlusions and varying viewpoints. These improvements suggest that our MVS framework can compete with and potentially exceed current state-of-the-art neural implicit field methods, especially in scenarios requiring real-time processing and scalability.
- [280] arXiv:2406.13522 [pdf, html, other]
-
Title: Measured-state conditioned recursive feasibility for stochastic model predictive controlSubjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
In this paper, we address the problem of designing stochastic model predictive control (MPC) schemes for linear systems affected by unbounded disturbances. The contribution of the paper is twofold. First, motivated by the difficulty of guaranteeing recursive feasibility in this framework, due to the nonzero probability of violating chance-constraints in the case of unbounded noise, we introduce the novel definition of measured-state conditioned recursive feasibility in expectation. Second, we construct a stochastic MPC scheme, based on the introduction of ellipsoidal probabilistic reachable sets, which implements a closed-loop initialization strategy, i.e., the current measured-state is employed for initializing the optimization problem. This new scheme is proven to satisfy the novel definition of recursive feasibility, and its superiority with respect to open-loop initialization schemes, arising from the fact that one never neglects the information brought by the current measurement, is shown through numerical examples.
- [281] arXiv:2406.13525 [pdf, html, other]
-
Title: On energy-dissipative finite element approximations for rate-type viscoelastic fluids with stress diffusionSubjects: Numerical Analysis (math.NA)
We study a fully discrete finite element approximation of a model for unsteady flows of rate-type viscoelastic fluids with stress diffusion in two and three dimensions. The model consists of the incompressible Navier--Stokes equation for the velocity, coupled with a diffusive variant of a combination of the Oldroyd-B and the Giesekus model for the left Cauchy--Green tensor. The discretization of the model is chosen such that an energy inequality is preserved at the fully discrete level. Thus, unconditional solvability and stability for the discrete system are guaranteed and the discrete Cauchy--Green tensor is positive definite. Moreover, subsequences of discrete solutions converge to a global-in-time weak solution, as the discretization parameters tend to zero. In the end, we present numerical convergence tests.
- [282] arXiv:2406.13527 [pdf, html, other]
-
Title: 4K4DGen: Panoramic 4D Generation at 4K ResolutionRenjie Li, Panwang Pan, Bangbang Yang, Dejia Xu, Shijie Zhou, Xuanyang Zhang, Zeming Li, Achuta Kadambi, Zhangyang Wang, Zhiwen FanSubjects: Computer Vision and Pattern Recognition (cs.CV)
The blooming of virtual reality and augmented reality (VR/AR) technologies has driven an increasing demand for the creation of high-quality, immersive, and dynamic environments. However, existing generative techniques either focus solely on dynamic objects or perform outpainting from a single perspective image, failing to meet the needs of VR/AR applications. In this work, we tackle the challenging task of elevating a single panorama to an immersive 4D experience. For the first time, we demonstrate the capability to generate omnidirectional dynamic scenes with 360-degree views at 4K resolution, thereby providing an immersive user experience. Our method introduces a pipeline that facilitates natural scene animations and optimizes a set of 4D Gaussians using efficient splatting techniques for real-time exploration. To overcome the lack of scene-scale annotated 4D data and models, especially in panoramic formats, we propose a novel Panoramic Denoiser that adapts generic 2D diffusion priors to animate consistently in 360-degree images, transforming them into panoramic videos with dynamic scenes at targeted regions. Subsequently, we elevate the panoramic video into a 4D immersive environment while preserving spatial and temporal consistency. By transferring prior knowledge from 2D models in the perspective domain to the panoramic domain and the 4D lifting with spatial appearance and geometry regularization, we achieve high-quality Panorama-to-4D generation at a resolution of (4096 $\times$ 2048) for the first time. See the project website at this https URL.
- [283] arXiv:2406.13529 [pdf, other]
-
Title: GMSNP and Finite StructuresSubjects: Discrete Mathematics (cs.DM); Combinatorics (math.CO); Logic (math.LO)
Given an (infinite) relational structure $\mathbb S$, we say that a finite structure $\mathbb C$ is a minimal finite factor of $\mathbb S$ if for every finite structure $\mathbb A$ there is a homomorphism $\mathbb S\to \mathbb A$ if and only if there is a homomorphism $\mathbb{C} \to \mathbb{A}$. In this brief note we prove that if CSP($\mathbb S$) is in GMSNP, then $\mathbb S$ has a minimal finite factor $\mathbb C$, and moreover, CSP($\mathbb C$) reduces in polynomial time to CSP($\mathbb S$). We discuss two nice applications of this result. First, we see that if a finite promise constraint satisfaction problem PCSP($\mathbb A,\mathbb B$) has a tractable GMSNP sandwich, then it has a tractable finite sandwich. We also show that if $\mathbb G$ is a non-bipartite (possibly infinite) graph with finite chromatic number, and CSP($\mathbb G$) is in GMSNP, then CSP($\mathbb G$) in NP-complete, partially answering a question recently asked by Bodirsky and Guzmán-Pro.
- [284] arXiv:2406.13532 [pdf, html, other]
-
Title: SALI: Short-term Alignment and Long-term Interaction Network for Colonoscopy Video Polyp SegmentationComments: Accepted to MICCAI 2024. Code and models: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Colonoscopy videos provide richer information in polyp segmentation for rectal cancer diagnosis. However, the endoscope's fast moving and close-up observing make the current methods suffer from large spatial incoherence and continuous low-quality frames, and thus yield limited segmentation accuracy. In this context, we focus on robust video polyp segmentation by enhancing the adjacent feature consistency and rebuilding the reliable polyp representation. To achieve this goal, we in this paper propose SALI network, a hybrid of Short-term Alignment Module (SAM) and Long-term Interaction Module (LIM). The SAM learns spatial-aligned features of adjacent frames via deformable convolution and further harmonizes them to capture more stable short-term polyp representation. In case of low-quality frames, the LIM stores the historical polyp representations as a long-term memory bank, and explores the retrospective relations to interactively rebuild more reliable polyp features for the current segmentation. Combing SAM and LIM, the SALI network of video segmentation shows a great robustness to the spatial variations and low-visual cues. Benchmark on the large-scale SUNSEG verifies the superiority of SALI over the current state-of-the-arts by improving Dice by 2.1%, 2.5%, 4.1% and 1.9%, for the four test sub-sets, respectively. Codes are at this https URL.
- [285] arXiv:2406.13533 [pdf, other]
-
Title: DRACO: Decentralized Asynchronous Federated Learning over Continuous Row-Stochastic Network MatricesComments: This paper has been submitted to a peer-reviewed journal and is currently under reviewSubjects: Machine Learning (cs.LG); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)
Recent developments and emerging use cases, such as smart Internet of Things (IoT) and Edge AI, have sparked considerable interest in the training of neural networks over fully decentralized (serverless) networks. One of the major challenges of decentralized learning is to ensure stable convergence without resorting to strong assumptions applied for each agent regarding data distributions or updating policies. To address these issues, we propose DRACO, a novel method for decentralized asynchronous Stochastic Gradient Descent (SGD) over row-stochastic gossip wireless networks by leveraging continuous communication. Our approach enables edge devices within decentralized networks to perform local training and model exchanging along a continuous timeline, thereby eliminating the necessity for synchronized timing. The algorithm also features a specific technique of decoupling communication and computation schedules, which empowers complete autonomy for all users and manageable instructions for stragglers. Through a comprehensive convergence analysis, we highlight the advantages of asynchronous and autonomous participation in decentralized optimization. Our numerical experiments corroborate the efficacy of the proposed technique.
- [286] arXiv:2406.13536 [pdf, html, other]
-
Title: Image Distillation for Safe Data Sharing in HistopathologyComments: accepted at MICCAI 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Histopathology can help clinicians make accurate diagnoses, determine disease prognosis, and plan appropriate treatment strategies. As deep learning techniques prove successful in the medical domain, the primary challenges become limited data availability and concerns about data sharing and privacy. Federated learning has addressed this challenge by training models locally and updating parameters on a server. However, issues, such as domain shift and bias, persist and impact overall performance. Dataset distillation presents an alternative approach to overcoming these challenges. It involves creating a small synthetic dataset that encapsulates essential information, which can be shared without constraints. At present, this paradigm is not practicable as current distillation approaches only generate non human readable representations and exhibit insufficient performance for downstream learning tasks. We train a latent diffusion model and construct a new distilled synthetic dataset with a small number of human readable synthetic images. Selection of maximally informative synthetic images is done via graph community analysis of the representation space. We compare downstream classification models trained on our synthetic distillation data to models trained on real data and reach performances suitable for practical application.
- [287] arXiv:2406.13542 [pdf, html, other]
-
Title: Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language ModelsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
One core capability of large language models (LLMs) is to follow natural language instructions. However, the issue of automatically constructing high-quality training data to enhance the complex instruction-following abilities of LLMs without manual annotation remains unresolved. In this paper, we introduce AutoIF, the first scalable and reliable method for automatically generating instruction-following training data. AutoIF transforms the validation of instruction-following data quality into code verification, requiring LLMs to generate instructions, the corresponding code to check the correctness of the instruction responses, and unit test samples to verify the code's correctness. Then, execution feedback-based rejection sampling can generate data for Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) training. AutoIF achieves significant improvements across three training algorithms, SFT, Offline DPO, and Online DPO, when applied to the top open-source LLMs, Qwen2 and LLaMA3, in self-alignment and strong-to-weak distillation settings. Our code is publicly available at this https URL.
- [288] arXiv:2406.13543 [pdf, html, other]
-
Title: Towards Cyber Threat Intelligence for the IoTJournal-ref: 2023 19th International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT). IEEE, 2023Subjects: Cryptography and Security (cs.CR)
With the proliferation of digitization and its usage in critical sectors, it is necessary to include information about the occurrence and assessment of cyber threats in an organization's threat mitigation strategy. This Cyber Threat Intelligence (CTI) is becoming increasingly important, or rather necessary, for critical national and industrial infrastructures. Current CTI solutions are rather federated and unsuitable for sharing threat information from low-power IoT devices. This paper presents a taxonomy and analysis of the CTI frameworks and CTI exchange platforms available today. It proposes a new CTI architecture relying on the MISP Threat Intelligence Sharing Platform customized and focusing on IoT environment. The paper also introduces a tailored version of STIX (which we call tinySTIX), one of the most prominent standards adopted for CTI data modeling, optimized for low-power IoT devices using the new lightweight encoding and cryptography solutions. The proposed CTI architecture will be very beneficial for securing IoT networks, especially the ones working in harsh and adversarial environments.
- [289] arXiv:2406.13544 [pdf, html, other]
-
Title: One Fits All: Learning Fair Graph Neural Networks for Various Sensitive AttributesComments: Accepted by KDD 2024Subjects: Machine Learning (cs.LG)
Recent studies have highlighted fairness issues in Graph Neural Networks (GNNs), where they produce discriminatory predictions against specific protected groups categorized by sensitive attributes such as race and age. While various efforts to enhance GNN fairness have made significant progress, these approaches are often tailored to specific sensitive attributes. Consequently, they necessitate retraining the model from scratch to accommodate changes in the sensitive attribute requirement, resulting in high computational costs. To gain deeper insights into this issue, we approach the graph fairness problem from a causal modeling perspective, where we identify the confounding effect induced by the sensitive attribute as the underlying reason. Motivated by this observation, we formulate the fairness problem in graphs from an invariant learning perspective, which aims to learn invariant representations across environments. Accordingly, we propose a graph fairness framework based on invariant learning, namely FairINV, which enables the training of fair GNNs to accommodate various sensitive attributes within a single training session. Specifically, FairINV incorporates sensitive attribute partition and trains fair GNNs by eliminating spurious correlations between the label and various sensitive attributes. Experimental results on several real-world datasets demonstrate that FairINV significantly outperforms state-of-the-art fairness approaches, underscoring its effectiveness. Our code is available via: this https URL.
- [290] arXiv:2406.13547 [pdf, html, other]
-
Title: ModSec-Learn: Boosting ModSecurity with Machine LearningChristian Scano, Giuseppe Floris, Biagio Montaruli, Luca Demetrio, Andrea Valenza, Luca Compagna, Davide Ariu, Luca Piras, Davide Balzarotti, Battista BiggioComments: arXiv admin note: text overlap with arXiv:2308.04964Subjects: Machine Learning (cs.LG)
ModSecurity is widely recognized as the standard open-source Web Application Firewall (WAF), maintained by the OWASP Foundation. It detects malicious requests by matching them against the Core Rule Set (CRS), identifying well-known attack patterns. Each rule is manually assigned a weight based on the severity of the corresponding attack, and a request is blocked if the sum of the weights of matched rules exceeds a given threshold. However, we argue that this strategy is largely ineffective against web attacks, as detection is only based on heuristics and not customized on the application to protect. In this work, we overcome this issue by proposing a machine-learning model that uses the CRS rules as input features. Through training, ModSec-Learn is able to tune the contribution of each CRS rule to predictions, thus adapting the severity level to the web applications to protect. Our experiments show that ModSec-Learn achieves a significantly better trade-off between detection and false positive rates. Finally, we analyze how sparse regularization can reduce the number of rules that are relevant at inference time, by discarding more than 30% of the CRS rules. We release our open-source code and the dataset at this https URL and this https URL, respectively.
- [291] arXiv:2406.13551 [pdf, html, other]
-
Title: Mitigating Social Biases in Language Models through UnlearningOmkar Dige, Diljot Singh, Tsz Fung Yau, Qixuan Zhang, Borna Bolandraftar, Xiaodan Zhu, Faiza Khan KhattakSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Mitigating bias in language models (LMs) has become a critical problem due to the widespread deployment of LMs. Numerous approaches revolve around data pre-processing and fine-tuning of language models, tasks that can be both time-consuming and computationally demanding. Consequently, there is a growing interest in machine unlearning techniques given their capacity to induce the forgetting of undesired behaviors of the existing pre-trained or fine-tuned models with lower computational cost. In this work, we explore two unlearning methods, (1) Partitioned Contrastive Gradient Unlearning (PCGU) applied on decoder models and (2) Negation via Task Vector, to reduce social biases in state-of-the-art and open-source LMs such as LLaMA-2 and OPT. We also implement distributed PCGU for large models. It is empirically shown, through quantitative and qualitative analyses, that negation via Task Vector method outperforms PCGU in debiasing with minimum deterioration in performance and perplexity of the models. On LLaMA-27B, negation via Task Vector reduces the bias score by 11.8%
- [292] arXiv:2406.13552 [pdf, html, other]
-
Title: Standardness Fogs Meaning: A Position Regarding the Informed Usage of Standard DatasetsSubjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
Standard datasets are frequently used to train and evaluate Machine Learning models. However, the assumed standardness of these datasets leads to a lack of in-depth discussion on how their labels match the derived categories for the respective use case. In other words, the standardness of the datasets seems to fog coherency and applicability, thus impeding the trust in Machine Learning models. We propose to adopt Grounded Theory and Hypotheses Testing through Visualization as methods to evaluate the match between use case, derived categories, and labels of standard datasets. To showcase the approach, we apply it to the 20 Newsgroups dataset and the MNIST dataset. For the 20 Newsgroups dataset, we demonstrate that the labels are imprecise. Therefore, we argue that neither a Machine Learning model can learn a meaningful abstraction of derived categories nor one can draw conclusions from achieving high accuracy. For the MNIST dataset, we demonstrate how the labels can be confirmed to be defined well. We conclude that a concept of standardness of a dataset implies that there is a match between use case, derived categories, and class labels, as in the case of the MNIST dataset. We argue that this is necessary to learn a meaningful abstraction and, thus, improve trust in the Machine Learning model.
- [293] arXiv:2406.13553 [pdf, html, other]
-
Title: Mining United Nations General Assembly DebatesMateusz Grzyb, Mateusz Krzyziński, Bartłomiej Sobieski, Mikołaj Spytek, Bartosz Pieliński, Daniel Dan, Anna WróblewskaComments: 4 pages, 1 figure, 2 tablesSubjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)
This project explores the application of Natural Language Processing (NLP) techniques to analyse United Nations General Assembly (UNGA) speeches. Using NLP allows for the efficient processing and analysis of large volumes of textual data, enabling the extraction of semantic patterns, sentiment analysis, and topic modelling. Our goal is to deliver a comprehensive dataset and a tool (interface with descriptive statistics and automatically extracted topics) from which political scientists can derive insights into international relations and have the opportunity to have a nuanced understanding of global diplomatic discourse.
- [294] arXiv:2406.13555 [pdf, html, other]
-
Title: BiLD: Bi-directional Logits Difference Loss for Large Language Model DistillationComments: Submitted to ARR June (for EMNLP 2024)Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
In recent years, large language models (LLMs) have shown exceptional capabilities across various natural language processing (NLP) tasks. However, such impressive performance often comes with the trade-off of an increased parameter size, posing significant challenges for widespread deployment. Knowledge distillation (KD) provides a solution by transferring knowledge from a large teacher model to a smaller student model. In this paper, we explore the task-specific distillation of LLMs at the logit level. Our investigation reveals that the logits of fine-tuned LLMs exhibit a more extreme long-tail distribution than those from vision models, with hidden "noise" in the long tail affecting distillation performance. Furthermore, existing logits distillation methods often struggle to effectively utilize the internal ranking information from the logits. To address these, we propose the Bi-directional Logits Difference (BiLD) loss. The BiLD loss filters out the long-tail noise by utilizing only top-$k$ teacher and student logits, and leverages the internal logits ranking information by constructing logits differences. To evaluate BiLD loss, we conduct comprehensive experiments on 13 datasets using two types of LLMs. Our results show that the BiLD loss, with only the top-8 logits, outperforms supervised fine-tuning (SFT), vanilla KL loss, and five other distillation methods from both NLP and CV fields.
- [295] arXiv:2406.13556 [pdf, html, other]
-
Title: Evaluating Short-Term Temporal Fluctuations of Social Biases in Social Media Data and Masked Language ModelsSubjects: Computation and Language (cs.CL)
Social biases such as gender or racial biases have been reported in language models (LMs), including Masked Language Models (MLMs). Given that MLMs are continuously trained with increasing amounts of additional data collected over time, an important yet unanswered question is how the social biases encoded with MLMs vary over time. In particular, the number of social media users continues to grow at an exponential rate, and it is a valid concern for the MLMs trained specifically on social media data whether their social biases (if any) would also amplify over time. To empirically analyse this problem, we use a series of MLMs pretrained on chronologically ordered temporal snapshots of corpora. Our analysis reveals that, although social biases are present in all MLMs, most types of social bias remain relatively stable over time (with a few exceptions). To further understand the mechanisms that influence social biases in MLMs, we analyse the temporal corpora used to train the MLMs. Our findings show that some demographic groups, such as male, obtain higher preference over the other, such as female on the training corpora constantly.
- [296] arXiv:2406.13557 [pdf, other]
-
Title: satsuma: Structure-based Symmetry Breaking in SATComments: accepted to SAT 2024Subjects: Data Structures and Algorithms (cs.DS)
Symmetry reduction is crucial for solving many interesting SAT instances in practice. Numerous approaches have been proposed, which try to strike a balance between symmetry reduction and computational overhead. Arguably the most readily applicable method is the computation of static symmetry breaking constraints: a constraint restricting the search-space to non-symmetrical solutions is added to a given SAT instance. A distinct advantage of static symmetry breaking is that the SAT solver itself is not modified. A disadvantage is that the strength of symmetry reduction is usually limited. In order to boost symmetry reduction, the state-of-the-art tool BreakID [Devriendt et. al] pioneered the identification and tailored breaking of a particular substructure of symmetries, the so-called row interchangeability groups.
In this paper, we propose a new symmetry breaking tool called satsuma. The core principle of our tool is to exploit more diverse but frequently occurring symmetry structures. This is enabled by new practical detection algorithms for row interchangeability, row-column symmetry, Johnson symmetry, and various combinations. Based on the resulting structural description, we then produce symmetry breaking constraints. We compare this new approach to BreakID on a range of instance families exhibiting symmetry. Our benchmarks suggest improved symmetry reduction in the presence of Johnson symmetry and comparable performance in the presence of row-column symmetry. Moreover, our implementation runs significantly faster, even though it identifies more diverse structures. - [297] arXiv:2406.13558 [pdf, html, other]
-
Title: Enhancing Travel Choice Modeling with Large Language Models: A Prompt-Learning ApproachSubjects: Artificial Intelligence (cs.AI)
Travel choice analysis is crucial for understanding individual travel behavior to develop appropriate transport policies and recommendation systems in Intelligent Transportation Systems (ITS). Despite extensive research, this domain faces two critical challenges: a) modeling with limited survey data, and b) simultaneously achieving high model explainability and accuracy. In this paper, we introduce a novel prompt-learning-based Large Language Model(LLM) framework that significantly improves prediction accuracy and provides explicit explanations for individual predictions. This framework involves three main steps: transforming input variables into textual form; building of demonstrations similar to the object, and applying these to a well-trained LLM. We tested the framework's efficacy using two widely used choice datasets: London Passenger Mode Choice (LPMC) and Optima-Mode collected in Switzerland. The results indicate that the LLM significantly outperforms state-of-the-art deep learning methods and discrete choice models in predicting people's choices. Additionally, we present a case of explanation illustrating how the LLM framework generates understandable and explicit explanations at the individual level.
- [298] arXiv:2406.13559 [pdf, html, other]
-
Title: Solarcast-ML: Per Node GraphCast Extension for Solar Energy ProductionComments: 5 pages, 5 figures, project website: this http URLSubjects: Machine Learning (cs.LG)
This project presents an extension to the GraphCast model, a state-of-the-art graph neural network (GNN) for global weather forecasting, by integrating solar energy production forecasting capabilities. The proposed approach leverages the weather forecasts generated by GraphCast and trains a neural network model to predict the ratio of actual solar output to potential solar output based on various weather conditions. The model architecture consists of an input layer corresponding to weather features (temperature, humidity, dew point, wind speed, rain, barometric pressure, and altitude), two hidden layers with ReLU activations, and an output layer predicting solar radiation. The model is trained using a mean absolute error loss function and Adam optimizer. The results demonstrate the model's effectiveness in accurately predicting solar radiation, with its convergence behavior, decreasing training loss, and accurate prediction of solar radiation patterns suggesting successful learning of the underlying relationships between weather conditions and solar radiation. The integration of solar energy production forecasting with GraphCast offers valuable insights for the renewable energy sector, enabling better planning and decision-making based on expected solar energy production. Future work could explore further model refinements, incorporation of additional weather variables, and extension to other renewable energy sources.
- [299] arXiv:2406.13560 [pdf, html, other]
-
Title: Lexically Grounded Subword SegmentationComments: 8 pages (+ 8 pages appendix), 2 figuresSubjects: Computation and Language (cs.CL)
We present three innovations in tokenization and subword segmentation. First, we propose to use unsupervised morphological analysis with Morfessor as pre-tokenization. Second, we present an algebraic method for obtaining subword embeddings grounded in a word embedding space. Based on that, we design a novel subword segmentation algorithm that uses the embeddings, ensuring that the procedure considers lexical meaning. Third, we introduce an efficient segmentation algorithm based on a subword bigram model that can be initialized with the lexically aware segmentation method to avoid using Morfessor and large embedding tables at inference time. We evaluate the proposed approaches using two intrinsic metrics and measure their performance on two downstream tasks: part-of-speech tagging and machine translation. Our experiments show significant improvements in the morphological plausibility of the segmentation when evaluated using segmentation precision on morpheme boundaries and improved Rényi efficiency in 8 languages. Although the proposed tokenization methods do not have a large impact on automatic translation quality, we observe consistent performance gains in the arguably more morphological task of part-of-speech tagging.
- [300] arXiv:2406.13564 [pdf, html, other]
-
Title: Is AI fun? HumorDB: a curated dataset and benchmark to investigate graphical humorComments: 7 main figures, 5 additional appendix figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Despite significant advancements in computer vision, understanding complex scenes, particularly those involving humor, remains a substantial challenge. This paper introduces HumorDB, a novel image-only dataset specifically designed to advance visual humor understanding. HumorDB consists of meticulously curated image pairs with contrasting humor ratings, emphasizing subtle visual cues that trigger humor and mitigating potential biases. The dataset enables evaluation through binary classification(Funny or Not Funny), range regression(funniness on a scale from 1 to 10), and pairwise comparison tasks(Which Image is Funnier?), effectively capturing the subjective nature of humor perception. Initial experiments reveal that while vision-only models struggle, vision-language models, particularly those leveraging large language models, show promising results. HumorDB also shows potential as a valuable zero-shot benchmark for powerful large multimodal models. We open-source both the dataset and code under the CC BY 4.0 license.
- [301] arXiv:2406.13565 [pdf, html, other]
-
Title: Exploring Multi-view Pixel Contrast for General and Robust Image Forgery LocalizationSubjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
Image forgery localization, which aims to segment tampered regions in an image, is a fundamental yet challenging digital forensic task. While some deep learning-based forensic methods have achieved impressive results, they directly learn pixel-to-label mappings without fully exploiting the relationship between pixels in the feature space. To address such deficiency, we propose a Multi-view Pixel-wise Contrastive algorithm (MPC) for image forgery localization. Specifically, we first pre-train the backbone network with the supervised contrastive loss to model pixel relationships from the perspectives of within-image, cross-scale and cross-modality. That is aimed at increasing intra-class compactness and inter-class separability. Then the localization head is fine-tuned using the cross-entropy loss, resulting in a better pixel localizer. The MPC is trained on three different scale training datasets to make a comprehensive and fair comparison with existing image forgery localization algorithms. Extensive experiments on the small, medium and large scale training datasets show that the proposed MPC achieves higher generalization performance and robustness against post-processing than the state-of-the-arts. Code will be available at this https URL.
- [302] arXiv:2406.13566 [pdf, html, other]
-
Title: Parametric finite element approximation of two-phase Navier--Stokes flow with viscoelasticitySubjects: Numerical Analysis (math.NA)
In this work, we present a parametric finite element approximation of two-phase Navier-Stokes flow with viscoelasticity. The free boundary problem is given by the viscoelastic Navier-Stokes equations in the two fluid phases, connected by jump conditions across the interface. The elasticity in the fluids is characterised using the Oldroyd-B model with possible stress diffusion. The model was originally introduced to approximate fluid-structure interaction problems between an incompressible Newtonian fluid and a hyperelastic neo-Hookean solid, which are possible limit cases of the model. We approximate a variational formulation of the model with an unfitted finite element method that uses piecewise linear parametric finite elements. The two-phase Navier-Stokes-Oldroyd-B system in the bulk regions is discretised in a way that guarantees unconditional solvability and stability for the coupled bulk-interface system. Good volume conservation properties for the two phases are observed in the case where the pressure approximation space is enriched with the help of an XFEM function. We show the applicability of our method with some numerical results.
- [303] arXiv:2406.13567 [pdf, html, other]
-
Title: Galerkin Neural Network-POD for Acoustic and Electromagnetic Wave Propagation in Parametric DomainsSubjects: Numerical Analysis (math.NA)
We investigate reduced-order models for acoustic and electromagnetic wave problems in parametrically defined domains. The parameter-to-solution maps are approximated following the so-called Galerkin POD-NN method, which combines the construction of a reduced basis via proper orthogonal decomposition (POD) with neural networks (NNs). As opposed to the standard reduced basis method, this approach allows for the swift and efficient evaluation of reduced-order solutions for any given parametric input.
As is customary in the analysis of problems in random or parametrically defined domains, we start by transporting the formulation to a reference domain. This yields a parameter-dependent variational problem set on parameter-independent functional spaces. In particular, we consider affine-parametric domain transformations characterized by a high-dimensional, possibly countably infinite, parametric input. To keep the number of evaluations of the high-fidelity solutions manageable, we propose using low-discrepancy sequences to sample the parameter space efficiently. Then, we train an NN to learn the coefficients in the reduced representation. This approach completely decouples the offline and online stages of the reduced basis paradigm.
Numerical results for the three-dimensional Helmholtz and Maxwell equations confirm the method's accuracy up to a certain barrier and show significant gains in online speed-up compared to the traditional Galerkin POD method. - [304] arXiv:2406.13568 [pdf, html, other]
-
Title: Trapezoidal Gradient Descent for Effective Reinforcement Learning in Spiking NetworksSubjects: Artificial Intelligence (cs.AI)
With the rapid development of artificial intelligence technology, the field of reinforcement learning has continuously achieved breakthroughs in both theory and practice. However, traditional reinforcement learning algorithms often entail high energy consumption during interactions with the environment. Spiking Neural Network (SNN), with their low energy consumption characteristics and performance comparable to deep neural networks, have garnered widespread attention. To reduce the energy consumption of practical applications of reinforcement learning, researchers have successively proposed the Pop-SAN and MDC-SAN algorithms. Nonetheless, these algorithms use rectangular functions to approximate the spike network during the training process, resulting in low sensitivity, thus indicating room for improvement in the training effectiveness of SNN. Based on this, we propose a trapezoidal approximation gradient method to replace the spike network, which not only preserves the original stable learning state but also enhances the model's adaptability and response sensitivity under various signal dynamics. Simulation results show that the improved algorithm, using the trapezoidal approximation gradient to replace the spike network, achieves better convergence speed and performance compared to the original algorithm and demonstrates good training stability.
- [305] arXiv:2406.13569 [pdf, html, other]
-
Title: Bayes' capacity as a measure for reconstruction attacks in federated learningSayan Biswas, Mark Dras, Pedro Faustini, Natasha Fernandes, Annabelle McIver, Catuscia Palamidessi, Parastoo SadeghiSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Theory (cs.IT)
Within the machine learning community, reconstruction attacks are a principal attack of concern and have been identified even in federated learning, which was designed with privacy preservation in mind. In federated learning, it has been shown that an adversary with knowledge of the machine learning architecture is able to infer the exact value of a training element given an observation of the weight updates performed during stochastic gradient descent. In response to these threats, the privacy community recommends the use of differential privacy in the stochastic gradient descent algorithm, termed DP-SGD. However, DP has not yet been formally established as an effective countermeasure against reconstruction attacks. In this paper, we formalise the reconstruction threat model using the information-theoretic framework of quantitative information flow. We show that the Bayes' capacity, related to the Sibson mutual information of order infinity, represents a tight upper bound on the leakage of the DP-SGD algorithm to an adversary interested in performing a reconstruction attack. We provide empirical results demonstrating the effectiveness of this measure for comparing mechanisms against reconstruction threats.
- [306] arXiv:2406.13570 [pdf, other]
-
Title: System Immersion of a Driving Simulator Affects the Oscillatory Brain ActivitySubjects: Human-Computer Interaction (cs.HC)
The technological properties of a system delivering simulation experience are a crucial dimension of immersion. To create a sense of presence and reproduce drivers behaviour as realistically as possible, we need reliable driving simulators that allow drivers to become highly immersed. This study investigates the impact of a system immersion of a driving simulator on the drivers' brain activity while operating a conditionally automated vehicle. Nineteen participants drove approximately 40 minutes while their brain activity was recorded using electroencephalography (EEG). We found a significant effect of the system immersion in the occipital and parietal areas, primarily in the high-Beta bandwidth. No effect was found in the Theta, Alpha, and low-Beta bandwidths. These findings suggest that the system immersion might influence the drivers' physiological arousal, consequently influencing their cognitive and emotional processes.
- [307] arXiv:2406.13573 [pdf, html, other]
-
Title: Improved Bounds for Fully Dynamic Matching via Ordered Ruzsa-Szemer\'edi GraphsComments: 24 pages, 2 figuresSubjects: Data Structures and Algorithms (cs.DS)
In a very recent breakthrough, Behnezhad and Ghafari [arXiv'24] developed a novel fully dynamic randomized algorithm for maintaining a $(1-\epsilon)$-approximation of maximum matching with amortized update time potentially much better than the trivial $O(n)$ update time. The runtime of the BG algorithm is parameterized via the following graph theoretical concept:
* For any $n$, define $ORS(n)$ -- standing for Ordered RS Graph -- to be the largest number of edge-disjoint matchings $M_1,\ldots,M_t$ of size $\Theta(n)$ in an $n$-vertex graph such that for every $i \in [t]$, $M_i$ is an induced matching in the subgraph $M_{i} \cup M_{i+1} \cup \ldots \cup M_t$.
Then, for any fixed $\epsilon > 0$, the BG algorithm runs in \[
O\left( \sqrt{n^{1+O(\epsilon)} \cdot ORS(n)} \right) \] amortized update time with high probability, even against an adaptive adversary. $ORS(n)$ is a close variant of a more well-known quantity regarding RS graphs (which require every matching to be induced regardless of the ordering). It is currently only known that $n^{o(1)} \leqslant ORS(n) \leqslant n^{1-o(1)}$, and closing this gap appears to be a notoriously challenging problem.
In this work, we further strengthen the result of Behnezhad and Ghafari and push it to limit to obtain a randomized algorithm with amortized update time of \[
n^{o(1)} \cdot ORS(n) \] with high probability, even against an adaptive adversary. In the limit, i.e., if current lower bounds for $ORS(n) = n^{o(1)}$ are almost optimal, our algorithm achieves an $n^{o(1)}$ update time for $(1-\epsilon)$-approximation of maximum matching, almost fully resolving this fundamental question. In its current stage also, this fully reduces the algorithmic problem of designing dynamic matching algorithms to a purely combinatorial problem of upper bounding $ORS(n)$ with no algorithmic considerations. - [308] arXiv:2406.13576 [pdf, html, other]
-
Title: Trusted Video Inpainting Localization via Deep Attentive Noise LearningSubjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
Digital video inpainting techniques have been substantially improved with deep learning in recent years. Although inpainting is originally designed to repair damaged areas, it can also be used as malicious manipulation to remove important objects for creating false scenes and facts. As such it is significant to identify inpainted regions blindly. In this paper, we present a Trusted Video Inpainting Localization network (TruVIL) with excellent robustness and generalization ability. Observing that high-frequency noise can effectively unveil the inpainted regions, we design deep attentive noise learning in multiple stages to capture the inpainting traces. Firstly, a multi-scale noise extraction module based on 3D High Pass (HP3D) layers is used to create the noise modality from input RGB frames. Then the correlation between such two complementary modalities are explored by a cross-modality attentive fusion module to facilitate mutual feature learning. Lastly, spatial details are selectively enhanced by an attentive noise decoding module to boost the localization performance of the network. To prepare enough training samples, we also build a frame-level video object segmentation dataset of 2500 videos with pixel-level annotation for all frames. Extensive experimental results validate the superiority of TruVIL compared with the state-of-the-arts. In particular, both quantitative and qualitative evaluations on various inpainted videos verify the remarkable robustness and generalization ability of our proposed TruVIL. Code and dataset will be available at this https URL.
- [309] arXiv:2406.13578 [pdf, html, other]
-
Title: Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph IntegrationHan-Cheng Yu, Yu-An Shih, Kin-Man Law, Kai-Yu Hsieh, Yu-Chen Cheng, Hsin-Chih Ho, Zih-An Lin, Wen-Chuan Hsu, Yao-Chung FanComments: Findings at ACL 2024Subjects: Computation and Language (cs.CL)
In this paper, we tackle the task of distractor generation (DG) for multiple-choice questions. Our study introduces two key designs. First, we propose \textit{retrieval augmented pretraining}, which involves refining the language model pretraining to align it more closely with the downstream task of DG. Second, we explore the integration of knowledge graphs to enhance the performance of DG. Through experiments with benchmarking datasets, we show that our models significantly outperform the state-of-the-art results. Our best-performing model advances the F1@3 score from 14.80 to 16.47 in MCQ dataset and from 15.92 to 16.50 in Sciq dataset.
- [310] arXiv:2406.13579 [pdf, other]
-
Title: Automated Bioacoustic Monitoring for South African Bird Species on Unlabeled DataMichael Doell, Dominik Kuehn, Vanessa Suessle, Matthew J. Burnett, Colleen T. Downs, Andreas Weinmann, Elke HergenroetherComments: preprintJournal-ref: International Conferences in Central Europe on Computer Graphics, Visualization and Computer Vision 2024Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
Analyses for biodiversity monitoring based on passive acoustic monitoring (PAM) recordings is time-consuming and challenged by the presence of background noise in recordings. Existing models for sound event detection (SED) worked only on certain avian species and the development of further models required labeled data. The developed framework automatically extracted labeled data from available platforms for selected avian species. The labeled data were embedded into recordings, including environmental sounds and noise, and were used to train convolutional recurrent neural network (CRNN) models. The models were evaluated on unprocessed real world data recorded in urban KwaZulu-Natal habitats. The Adapted SED-CRNN model reached a F1 score of 0.73, demonstrating its efficiency under noisy, real-world conditions. The proposed approach to automatically extract labeled data for chosen avian species enables an easy adaption of PAM to other species and habitats for future conservation projects.
- [311] arXiv:2406.13583 [pdf, html, other]
-
Title: Low-Rank Mixture-of-Experts for Continual Medical Image SegmentationSubjects: Computer Vision and Pattern Recognition (cs.CV)
The primary goal of continual learning (CL) task in medical image segmentation field is to solve the "catastrophic forgetting" problem, where the model totally forgets previously learned features when it is extended to new categories (class-level) or tasks (task-level). Due to the privacy protection, the historical data labels are inaccessible. Prevalent continual learning methods primarily focus on generating pseudo-labels for old datasets to force the model to memorize the learned features. However, the incorrect pseudo-labels may corrupt the learned feature and lead to a new problem that the better the model is trained on the old task, the poorer the model performs on the new tasks. To avoid this problem, we propose a network by introducing the data-specific Mixture of Experts (MoE) structure to handle the new tasks or categories, ensuring that the network parameters of previous tasks are unaffected or only minimally impacted. To further overcome the tremendous memory costs caused by introducing additional structures, we propose a Low-Rank strategy which significantly reduces memory cost. We validate our method on both class-level and task-level continual learning challenges. Extensive experiments on multiple datasets show our model outperforms all other methods.
- [312] arXiv:2406.13584 [pdf, html, other]
-
Title: Explaining time series models using frequency maskingComments: Submitted to the Next Generation of AI Safety workshop at ICML 2024Subjects: Machine Learning (cs.LG)
Time series data is fundamentally important for describing many critical domains such as healthcare, finance, and climate, where explainable models are necessary for safe automated decision-making. To develop eXplainable AI (XAI) in these domains therefore implies explaining salient information in the time series. Current methods for obtaining saliency maps assumes localized information in the raw input space. In this paper, we argue that the salient information of a number of time series is more likely to be localized in the frequency domain. We propose FreqRISE, which uses masking based methods to produce explanations in the frequency and time-frequency domain, which shows the best performance across a number of tasks.
- [313] arXiv:2406.13585 [pdf, html, other]
-
Title: MEV Ecosystem Evolution From Ethereum 1.0Subjects: Cryptography and Security (cs.CR)
Smart contracts led to the emergence of the decentralized finance (DeFi) marketplace within blockchain ecosystems, where diverse participants engage in financial activities. In traditional finance, there are possibilities to create values, e.g., arbitrage offers to create value from market inefficiencies or front-running offers to extract value for the participants having privileged roles. Such opportunities are readily available -- searching programmatically in DeFi. It is commonly known as Maximal Extractable Value (MEV) in the literature. In this survey, first, we show how lucrative such opportunities can be. Next, we discuss how protocol-following participants trying to capture such opportunities threaten to sabotage blockchain's performance and the core tenets of decentralization, transparency, and trustlessness that blockchains are based on. Then, we explain different attempts by the community in the past to address these issues and the problems introduced by these solutions. Finally, we review the current state of research trying to restore trustlessness and decentralization to provide all DeFi participants with a fair marketplace.
- [314] arXiv:2406.13586 [pdf, html, other]
-
Title: Submodular Participatory BudgetingSubjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
Participatory budgeting refers to the practice of allocating public resources by collecting and aggregating individual preferences. Most existing studies in this field often assume an additive utility function, where each individual holds a private utility for each candidate project, and the total utility of a set of funded projects is simply the sum of the utilities of all projects. We argue that this assumption does not always hold in reality. For example, building two playgrounds in the same neighborhood does not necessarily lead to twice the utility of building a single playground.
To address this, we extend the existing study by proposing a submodular participatory budgeting problem, assuming that the utility function of each individual is a monotone and submodular function over funded projects. We propose and examine three preference elicitation methods, including \emph{ranking-by-marginal-values}, \emph{ranking-by-values} and \emph{threshold approval votes}, and analyze their performances in terms of distortion. Notably, if the utility function is addicative, our aggregation rule designed for threshold approval votes achieves a better distortion than the state-of-the-art approach. - [315] arXiv:2406.13588 [pdf, other]
-
Title: CNN Based Flank Predictor for Quadruped Animal SpeciesJournal-ref: Workshop Camera Traps, AI and Ecology 2023Subjects: Computer Vision and Pattern Recognition (cs.CV)
The bilateral asymmetry of flanks of animals with visual body marks that uniquely identify an individual, complicates tasks like population estimations. Automatically generated additional information on the visible side of the animal would improve the accuracy for individual identification. In this study we used transfer learning on popular CNN image classification architectures to train a flank predictor that predicts the visible flank of quadruped mammalian species in images. We automatically derived the data labels from existing datasets originally labeled for animal pose estimation. We trained the models in two phases with different degrees of retraining. The developed models were evaluated in different scenarios of different unknown quadruped species in known and unknown environments. As a real-world scenario, we used a dataset of manually labeled Eurasian lynx (Lynx lynx) from camera traps in the Bavarian Forest National Park to evaluate the model. The best model, trained on an EfficientNetV2 backbone, achieved an accuracy of 88.70 % for the unknown species lynx in a complex habitat.
- [316] arXiv:2406.13597 [pdf, html, other]
-
Title: GraphKAN: Enhancing Feature Extraction with Graph Kolmogorov Arnold NetworksSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Massive number of applications involve data with underlying relationships embedded in non-Euclidean space. Graph neural networks (GNNs) are utilized to extract features by capturing the dependencies within graphs. Despite groundbreaking performances, we argue that Multi-layer perceptrons (MLPs) and fixed activation functions impede the feature extraction due to information loss. Inspired by Kolmogorov Arnold Networks (KANs), we make the first attempt to GNNs with KANs. We discard MLPs and activation functions, and instead used KANs for feature extraction. Experiments demonstrate the effectiveness of GraphKAN, emphasizing the potential of KANs as a powerful tool. Code is available at this https URL.
- [317] arXiv:2406.13599 [pdf, other]
-
Title: Defying the Odds: Solana's Unexpected Resilience in Spite of the Security Challenges Faced by DevelopersSébastien Andreina, Tobias Cloosters, Lucas Davi, Jens-Rene Giesen, Marco Gutfleisch, Ghassan Karame, Alena Naiakshina, Houda NajiComments: To appear in the Proceedings of the 31st ACM Conference on Computer and Communications Security (CCS), 2024Subjects: Cryptography and Security (cs.CR)
Solana gained considerable attention as one of the most popular blockchain platforms for deploying decentralized applications. Compared to Ethereum, however, we observe a lack of research on how Solana smart contract developers handle security, what challenges they encounter, and how this affects the overall security of the ecosystem. To address this, we conducted the first comprehensive study on the Solana platform consisting of a 90-minute Solana smart contract code review task with 35 participants followed by interviews with a subset of seven participants. Our study shows, quite alarmingly, that none of the participants could detect all important security vulnerabilities in a code review task and that 83% of the participants are likely to release vulnerable smart contracts. Our study also sheds light on the root causes of developers' challenges with Solana smart contract development, suggesting the need for better security guidance and resources. In spite of these challenges, our automated analysis on currently deployed Solana smart contracts surprisingly suggests that the prevalence of vulnerabilities - especially those pointed out as the most challenging in our developer study - is below 0.3%. We explore the causes of this counter-intuitive resilience and show that frameworks, such as Anchor, are aiding Solana developers in deploying secure contracts.
- [318] arXiv:2406.13600 [pdf, html, other]
-
Title: CoDreamer: Communication-Based Decentralised World ModelsSubjects: Artificial Intelligence (cs.AI)
Sample efficiency is a critical challenge in reinforcement learning. Model-based RL has emerged as a solution, but its application has largely been confined to single-agent scenarios. In this work, we introduce CoDreamer, an extension of the Dreamer algorithm for multi-agent environments. CoDreamer leverages Graph Neural Networks for a two-level communication system to tackle challenges such as partial observability and inter-agent cooperation. Communication is separately utilised within the learned world models and within the learned policies of each agent to enhance modelling and task-solving. We show that CoDreamer offers greater expressive power than a naive application of Dreamer, and we demonstrate its superiority over baseline methods across various multi-agent environments.
- [319] arXiv:2406.13602 [pdf, other]
-
Title: Parameter Training Efficiency Aware Resource Allocation for AIGC in Space-Air-Ground Integrated NetworksComments: submitted to a journalSubjects: Emerging Technologies (cs.ET); Signal Processing (eess.SP)
With the evolution of artificial intelligence-generated content (AIGC) techniques and the development of space-air-ground integrated networks (SAGIN), there will be a growing opportunity to enhance more users' mobile experience with customized AIGC applications. This is made possible through the use of parameter-efficient fine-tuning (PEFT) training alongside mobile edge computing. In this paper, we formulate the optimization problem of maximizing the parameter training efficiency of the SAGIN system over wireless networks under limited resource constraints. We propose the Parameter training efficiency Aware Resource Allocation (PARA) technique to jointly optimize user association, data offloading, and communication and computational resource allocation. Solid proofs are presented to solve this difficult sum of ratios problem based on quadratically constrained quadratic programming (QCQP), semidefinite programming (SDP), graph theory, and fractional programming (FP) techniques. Our proposed PARA technique is effective in finding a stationary point of this non-convex problem. The simulation results demonstrate that the proposed PARA method outperforms other baselines.
- [320] arXiv:2406.13604 [pdf, html, other]
-
Title: Root Cause Localization for Microservice Systems in Cloud-edge Collaborative EnvironmentsSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Performance (cs.PF)
With the development of cloud-native technologies, microservice-based software systems face challenges in accurately localizing root causes when failures occur. Additionally, the cloud-edge collaborative environment introduces more difficulties, such as unstable networks and high latency across network segments. Accurately identifying the root cause of microservices in a cloud-edge collaborative environment has thus become an urgent problem. In this paper, we propose MicroCERCL, a novel approach that pinpoints root causes at the kernel and application level in the cloud-edge collaborative environment. Our key insight is that failures propagate through direct invocations and indirect resource-competition dependencies in a cloud-edge collaborative environment characterized by instability and high latency. This will become more complex in the hybrid deployment that simultaneously involves multiple microservice systems. Leveraging this insight, we extract valid contents from kernel-level logs to prioritize localizing the kernel-level root cause. Moreover, we construct a heterogeneous dynamic topology stack and train a graph neural network model to accurately localize the application-level root cause without relying on historical data. Notably, we released the first benchmark hybrid deployment microservice system in a cloud-edge collaborative environment (the largest and most complex within our knowledge). Experiments conducted on the dataset collected from the benchmark show that MicroCERCL can accurately localize the root cause of microservice systems in such environments, significantly outperforming state-of-the-art approaches with an increase of at least 24.1% in top-1 accuracy.
- [321] arXiv:2406.13605 [pdf, html, other]
-
Title: Nicer Than Humans: How do Large Language Models Behave in the Prisoner's Dilemma?Comments: 9 pages, 8 figures, 1 tableSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Physics and Society (physics.soc-ph)
The behavior of Large Language Models (LLMs) as artificial social agents is largely unexplored, and we still lack extensive evidence of how these agents react to simple social stimuli. Testing the behavior of AI agents in classic Game Theory experiments provides a promising theoretical framework for evaluating the norms and values of these agents in archetypal social situations. In this work, we investigate the cooperative behavior of Llama2 when playing the Iterated Prisoner's Dilemma against random adversaries displaying various levels of hostility. We introduce a systematic methodology to evaluate an LLM's comprehension of the game's rules and its capability to parse historical gameplay logs for decision-making. We conducted simulations of games lasting for 100 rounds, and analyzed the LLM's decisions in terms of dimensions defined in behavioral economics literature. We find that Llama2 tends not to initiate defection but it adopts a cautious approach towards cooperation, sharply shifting towards a behavior that is both forgiving and non-retaliatory only when the opponent reduces its rate of defection below 30%. In comparison to prior research on human participants, Llama2 exhibits a greater inclination towards cooperative behavior. Our systematic approach to the study of LLMs in game theoretical scenarios is a step towards using these simulations to inform practices of LLM auditing and alignment.
- [322] arXiv:2406.13606 [pdf, html, other]
-
Title: DDLNet: Boosting Remote Sensing Change Detection with Dual-Domain LearningComments: ICME 2024 OralSubjects: Computer Vision and Pattern Recognition (cs.CV)
Remote sensing change detection (RSCD) aims to identify the changes of interest in a region by analyzing multi-temporal remote sensing images, and has an outstanding value for local development monitoring. Existing RSCD methods are devoted to contextual modeling in the spatial domain to enhance the changes of interest. Despite the satisfactory performance achieved, the lack of knowledge in the frequency domain limits the further improvement of model performance. In this paper, we propose DDLNet, a RSCD network based on dual-domain learning (i.e., frequency and spatial domains). In particular, we design a Frequency-domain Enhancement Module (FEM) to capture frequency components from the input bi-temporal images using Discrete Cosine Transform (DCT) and thus enhance the changes of interest. Besides, we devise a Spatial-domain Recovery Module (SRM) to fuse spatiotemporal features for reconstructing spatial details of change representations. Extensive experiments on three benchmark RSCD datasets demonstrate that the proposed method achieves state-of-the-art performance and reaches a more satisfactory accuracy-efficiency trade-off. Our code is publicly available at this https URL.
- [323] arXiv:2406.13607 [pdf, html, other]
-
Title: Ultra-High-Definition Restoration: New Benchmarks and A Dual Interaction Prior-Driven SolutionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Ultra-High-Definition (UHD) image restoration has acquired remarkable attention due to its practical demand. In this paper, we construct UHD snow and rain benchmarks, named UHD-Snow and UHD-Rain, to remedy the deficiency in this field. The UHD-Snow/UHD-Rain is established by simulating the physics process of rain/snow into consideration and each benchmark contains 3200 degraded/clear image pairs of 4K resolution. Furthermore, we propose an effective UHD image restoration solution by considering gradient and normal priors in model design thanks to these priors' spatial and detail contributions. Specifically, our method contains two branches: (a) feature fusion and reconstruction branch in high-resolution space and (b) prior feature interaction branch in low-resolution space. The former learns high-resolution features and fuses prior-guided low-resolution features to reconstruct clear images, while the latter utilizes normal and gradient priors to mine useful spatial features and detail features to guide high-resolution recovery better. To better utilize these priors, we introduce single prior feature interaction and dual prior feature interaction, where the former respectively fuses normal and gradient priors with high-resolution features to enhance prior ones, while the latter calculates the similarity between enhanced prior ones and further exploits dual guided filtering to boost the feature interaction of dual priors. We conduct experiments on both new and existing public datasets and demonstrate the state-of-the-art performance of our method on UHD image low-light enhancement, UHD image desonwing, and UHD image deraining. The source codes and benchmarks are available at \url{this https URL}.
- [324] arXiv:2406.13608 [pdf, html, other]
-
Title: Wiretapped Commitment over Binary ChannelsComments: 13 Pages, 1 figureSubjects: Information Theory (cs.IT); Cryptography and Security (cs.CR)
We propose the problem of wiretapped commitment, where two parties, say committer Alice and receiver Bob, engage in a commitment protocol using a noisy channel as a resource, in the presence of an eavesdropper, say Eve. Noisy versions of Alice's transmission over the wiretap channel are received at both Bob and Eve. We seek to determine the maximum commitment throughput in the presence of an eavesdropper, i.e., wiretapped commitment capacity, where in addition to the standard security requirements for two-party commitment, one seeks to ensure that Eve doesn't learn about the commit string.
A key interest in this work is to explore the effect of collusion (or lack of it) between the eavesdropper Eve and either Alice or Bob. Toward the same, we present results on the wiretapped commitment capacity under the so-called 1-private regime (when Alice or Bob cannot collude with Eve) and the 2-private regime (when Alice or Bob may possibly collude with Eve). - [325] arXiv:2406.13617 [pdf, html, other]
-
Title: Optimizing Psychological Counseling with Instruction-Tuned Large Language ModelsComments: 9 pagesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
The advent of large language models (LLMs) has significantly advanced various fields, including natural language processing and automated dialogue systems. This paper explores the application of LLMs in psychological counseling, addressing the increasing demand for mental health services. We present a method for instruction tuning LLMs with specialized prompts to enhance their performance in providing empathetic, relevant, and supportive responses. Our approach involves developing a comprehensive dataset of counseling-specific prompts, refining them through feedback from professional counselors, and conducting rigorous evaluations using both automatic metrics and human assessments. The results demonstrate that our instruction-tuned model outperforms several baseline LLMs, highlighting its potential as a scalable and accessible tool for mental health support.
- [326] arXiv:2406.13618 [pdf, html, other]
-
Title: In-Context Former: Lightning-fast Compressing Context for Large Language ModelSubjects: Computation and Language (cs.CL)
With the rising popularity of Transformer-based large language models (LLMs), reducing their high inference costs has become a significant research focus. One effective approach is to compress the long input contexts. Existing methods typically leverage the self-attention mechanism of the LLM itself for context compression. While these methods have achieved notable results, the compression process still involves quadratic time complexity, which limits their applicability. To mitigate this limitation, we propose the In-Context Former (IC-Former). Unlike previous methods, IC-Former does not depend on the target LLMs. Instead, it leverages the cross-attention mechanism and a small number of learnable digest tokens to directly condense information from the contextual word embeddings. This approach significantly reduces inference time, which achieves linear growth in time complexity within the compression range. Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times while achieving over 90% of the baseline performance on evaluation metrics. Overall, our model effectively reduces compression costs and makes real-time compression scenarios feasible.
- [327] arXiv:2406.13620 [pdf, html, other]
-
Title: An Algorithm for the Assignment Game Beyond Additive ValuationsSubjects: Discrete Mathematics (cs.DM); Computer Science and Game Theory (cs.GT)
The assignment game, introduced by Shapley and Shubik (1971), is a classic model for two-sided matching markets between buyers and sellers. In the original assignment game, it is assumed that payments lead to transferable utility and that buyers have unit-demand valuations for the items being sold. Two important and mostly independent lines of work have studied more general settings with imperfectly transferable utility and gross substitutes valuations. Multiple efficient algorithms have been proposed for computing a competitive equilibrium, the standard solution concept in assignment games, in these two settings. Our main result is an efficient algorithm for computing competitive equilibria in a setting with both imperfectly transferable utility and gross substitutes valuations. Our algorithm combines augmenting path techniques from maximum matching and algorithms for matroid intersection. We also show that, in a mild generalization of our model, computing a competitive equilibrium is NP-hard.
- [328] arXiv:2406.13621 [pdf, html, other]
-
Title: Improving Visual Commonsense in Language Models via Multiple Image GenerationSubjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Commonsense reasoning is fundamentally based on multimodal knowledge. However, existing large language models (LLMs) are primarily trained using textual data only, limiting their ability to incorporate essential visual information. In contrast, Visual Language Models, which excel at visually-oriented tasks, often fail at non-visual tasks such as basic commonsense reasoning. This divergence highlights a critical challenge - the integration of robust visual understanding with foundational text-based language reasoning. To this end, we introduce a method aimed at enhancing LLMs' visual commonsense. Specifically, our method generates multiple images based on the input text prompt and integrates these into the model's decision-making process by mixing their prediction probabilities. To facilitate multimodal grounded language modeling, we employ a late-fusion layer that combines the projected visual features with the output of a pre-trained LLM conditioned on text only. This late-fusion layer enables predictions based on comprehensive image-text knowledge as well as text only when this is required. We evaluate our approach using several visual commonsense reasoning tasks together with traditional NLP tasks, including common sense reasoning and reading comprehension. Our experimental results demonstrate significant superiority over existing baselines. When applied to recent state-of-the-art LLMs (e.g., Llama3), we observe improvements not only in visual common sense but also in traditional NLP benchmarks. Code and models are available under this https URL.
- [329] arXiv:2406.13622 [pdf, other]
-
Title: A Sound and Complete Substitution Algorithm for Multimode Type Theory: Technical ReportComments: 36 pages, 10 figuresSubjects: Logic in Computer Science (cs.LO)
This is the technical report accompanying the paper "A Sound and Complete Substitution Algorithm for Multimode Type Theory" [Ceulemans, Nuyts and Devriese, 2024]. It contains a full definition of Well-Scoped Multimode Type Theory (WSMTT) in Section 2, including many rules for $\sigma$-equivalence and a description of all rules that have been omitted. Furthermore, we present completeness and soundness proofs of the substitution algorithm in full detail. These can be found in Sections 4 and 5 respectively. In order to make this document relatively self-contained, we also include a description of Substitution-Free Multimode Type Theory (SFMTT) in Section 3.
- [330] arXiv:2406.13625 [pdf, other]
-
Title: Enhance the Image: Super Resolution using Artificial Intelligence in MRIZiyu Li, Zihan Li, Haoxiang Li, Qiuyun Fan, Karla L. Miller, Wenchuan Wu, Akshay S. Chaudhari, Qiyuan TianComments: A book chapter in Machine Learning in MRI: From methods to clinical translation. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
This chapter provides an overview of deep learning techniques for improving the spatial resolution of MRI, ranging from convolutional neural networks, generative adversarial networks, to more advanced models including transformers, diffusion models, and implicit neural representations. Our exploration extends beyond the methodologies to scrutinize the impact of super-resolved images on clinical and neuroscientific assessments. We also cover various practical topics such as network architectures, image evaluation metrics, network loss functions, and training data specifics, including downsampling methods for simulating low-resolution images and dataset selection. Finally, we discuss existing challenges and potential future directions regarding the feasibility and reliability of deep learning-based MRI super-resolution, with the aim to facilitate its wider adoption to benefit various clinical and neuroscientific applications.
- [331] arXiv:2406.13626 [pdf, html, other]
-
Title: Fine-Tuning Gemma-7B for Enhanced Sentiment Analysis of Financial News HeadlinesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
In this study, we explore the application of sentiment analysis on financial news headlines to understand investor sentiment. By leveraging Natural Language Processing (NLP) and Large Language Models (LLM), we analyze sentiment from the perspective of retail investors. The FinancialPhraseBank dataset, which contains categorized sentiments of financial news headlines, serves as the basis for our analysis. We fine-tuned several models, including distilbert-base-uncased, Llama, and gemma-7b, to evaluate their effectiveness in sentiment classification. Our experiments demonstrate that the fine-tuned gemma-7b model outperforms others, achieving the highest precision, recall, and F1 score. Specifically, the gemma-7b model showed significant improvements in accuracy after fine-tuning, indicating its robustness in capturing the nuances of financial sentiment. This model can be instrumental in providing market insights, risk management, and aiding investment decisions by accurately predicting the sentiment of financial news. The results highlight the potential of advanced LLMs in transforming how we analyze and interpret financial information, offering a powerful tool for stakeholders in the financial industry.
- [332] arXiv:2406.13627 [pdf, html, other]
-
Title: Can AI be enabled to dynamical downscaling? Training a Latent Diffusion Model to mimic km-scale COSMO-CLM downscaling of ERA5 over ItalyComments: 24 pages, 14 figuresSubjects: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
Downscaling techniques are one of the most prominent applications of Deep Learning (DL) in Earth System Modeling. A robust DL downscaling model can generate high-resolution fields from coarse-scale numerical model simulations, saving the timely and resourceful applications of regional/local models. Additionally, generative DL models have the potential to provide uncertainty information, by generating ensemble-like scenario pools, a task that is computationally prohibitive for traditional numerical simulations. In this study, we apply a Latent Diffusion Model (LDM) to downscale ERA5 data over Italy up to a resolution of 2 km. The high-resolution target data consists of results from a high-resolution dynamical downscaling performed with COSMO-CLM. Our goal is to demonstrate that recent advancements in generative modeling enable DL-based models to deliver results comparable to those of numerical dynamical downscaling models, given the same input data (i.e., ERA5 data), preserving the realism of fine-scale features and flow characteristics. The training and testing database consists of hourly data from 2000 to 2020. The target variables of this study are 2-m temperature and 10-m horizontal wind components. A selection of predictors from ERA5 is used as input to the LDM, and a residual approach against a reference UNET is leveraged in applying the LDM. The performance of the generative LDM is compared with reference baselines of increasing complexity: quadratic interpolation of ERA5, a UNET, and a Generative Adversarial Network (GAN) built on the same reference UNET. Results highlight the improvements introduced by the LDM architecture and the residual approach over these baselines. The models are evaluated on a yearly test dataset, assessing the models' performance through deterministic metrics, spatial distribution of errors, and reconstruction of frequency and power spectra distributions.
- [333] arXiv:2406.13629 [pdf, html, other]
-
Title: InstructRAG: Instructing Retrieval-Augmented Generation with Explicit DenoisingComments: Code: this https URLSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Retrieval-augmented generation (RAG) has shown promising potential to enhance the accuracy and factuality of language models (LMs). However, imperfect retrievers or noisy corpora can introduce misleading or even erroneous information to the retrieved contents, posing a significant challenge to the generation quality. Existing RAG methods typically address this challenge by directly predicting final answers despite potentially noisy inputs, resulting in an implicit denoising process that is difficult to interpret and verify. On the other hand, the acquisition of explicit denoising supervision is often costly, involving significant human efforts. In this work, we propose InstructRAG, where LMs explicitly learn the denoising process through self-synthesized rationales -- First, we instruct the LM to explain how the ground-truth answer is derived from retrieved documents. Then, these rationales can be used either as demonstrations for in-context learning of explicit denoising or as supervised fine-tuning data to train the model. Compared to standard RAG approaches, InstructRAG requires no additional supervision, allows for easier verification of the predicted answers, and effectively improves generation accuracy. Experiments show InstructRAG consistently outperforms existing RAG methods in both training-free and trainable scenarios, achieving a relative improvement of 8.3% over the best baseline method on average across five knowledge-intensive benchmarks. Extensive analysis indicates that InstructRAG scales well with increased numbers of retrieved documents and consistently exhibits robust denoising ability even in out-of-domain datasets, demonstrating strong generalizability.
- [334] arXiv:2406.13631 [pdf, html, other]
-
Title: On AI-Inspired UI-DesignSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Graphical User Interface (or simply UI) is a primary mean of interaction between users and their device. In this paper, we discuss three major complementary approaches on how to use Artificial Intelligence (AI) to support app designers create better, more diverse, and creative UI of mobile apps. First, designers can prompt a Large Language Model (LLM) like GPT to directly generate and adjust one or multiple UIs. Second, a Vision-Language Model (VLM) enables designers to effectively search a large screenshot dataset, e.g. from apps published in app stores. The third approach is to train a Diffusion Model (DM) specifically designed to generate app UIs as inspirational images. We discuss how AI should be used, in general, to inspire and assist creative app design rather than automating it.
- [335] arXiv:2406.13632 [pdf, html, other]
-
Title: Can Few-shot Work in Long-Context? Recycling the Context to Generate DemonstrationsArie Cattan, Alon Jacovi, Alex Fabrikant, Jonathan Herzig, Roee Aharoni, Hannah Rashkin, Dror Marcus, Avinatan Hassidim, Yossi Matias, Idan Szpektor, Avi CaciularuSubjects: Computation and Language (cs.CL)
Despite recent advancements in Large Language Models (LLMs), their performance on tasks involving long contexts remains sub-optimal. In-Context Learning (ICL) with few-shot examples may be an appealing solution to enhance LLM performance in this scenario; However, naively adding ICL examples with long context introduces challenges, including substantial token overhead added for each few-shot example and context mismatch between the demonstrations and the target query. In this work, we propose to automatically generate few-shot examples for long context QA tasks by recycling contexts. Specifically, given a long input context (1-3k tokens) and a query, we generate additional query-output pairs from the given context as few-shot examples, while introducing the context only once. This ensures that the demonstrations are leveraging the same context as the target query while only adding a small number of tokens to the prompt. We further enhance each demonstration by instructing the model to explicitly identify the relevant paragraphs before the answer, which improves performance while providing fine-grained attribution to the answer source. We apply our method on multiple LLMs and obtain substantial improvements on various QA datasets with long context, especially when the answer lies within the middle of the context. Surprisingly, despite introducing only single-hop ICL examples, LLMs also successfully generalize to multi-hop long-context QA using our approach.
- [336] arXiv:2406.13633 [pdf, html, other]
-
Title: Reinforcement Learning for Infinite-Horizon Average-Reward MDPs with Multinomial Logistic Function ApproximationSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
We study model-based reinforcement learning with non-linear function approximation where the transition function of the underlying Markov decision process (MDP) is given by a multinomial logistic (MNL) model. In this paper, we develop two algorithms for the infinite-horizon average reward setting. Our first algorithm \texttt{UCRL2-MNL} applies to the class of communicating MDPs and achieves an $\tilde{\mathcal{O}}(dD\sqrt{T})$ regret, where $d$ is the dimension of feature mapping, $D$ is the diameter of the underlying MDP, and $T$ is the horizon. The second algorithm \texttt{OVIFH-MNL} is computationally more efficient and applies to the more general class of weakly communicating MDPs, for which we show a regret guarantee of $\tilde{\mathcal{O}}(d^{2/5} \mathrm{sp}(v^*)T^{4/5})$ where $\mathrm{sp}(v^*)$ is the span of the associated optimal bias function.
We also prove a lower bound of $\Omega(d\sqrt{DT})$ for learning communicating MDPs with MNL transitions of diameter at most $D$. Furthermore, we show a regret lower bound of $\Omega(dH^{3/2}\sqrt{K})$ for learning $H$-horizon episodic MDPs with MNL function approximation where $K$ is the number of episodes, which improves upon the best-known lower bound for the finite-horizon setting. - [337] arXiv:2406.13636 [pdf, html, other]
-
Title: Contrast Sets for Evaluating Language-Guided Robot PoliciesSubjects: Robotics (cs.RO); Machine Learning (cs.LG)
Robot evaluations in language-guided, real world settings are time-consuming and often sample only a small space of potential instructions across complex scenes. In this work, we introduce contrast sets for robotics as an approach to make small, but specific, perturbations to otherwise independent, identically distributed (i.i.d.) test instances. We investigate the relationship between experimenter effort to carry out an evaluation and the resulting estimated test performance as well as the insights that can be drawn from performance on perturbed instances. We use contrast sets to characterize policies at reduced experimenter effort in both a simulated manipulation task and a physical robot vision-and-language navigation task. We encourage the use of contrast set evaluations as a more informative alternative to small scale, i.i.d. demonstrations on physical robots, and as a scalable alternative to industry-scale real world evaluations.
- [338] arXiv:2406.13640 [pdf, html, other]
-
Title: Transferable Tactile Transformers for Representation Learning Across Diverse Sensors and TasksSubjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
This paper presents T3: Transferable Tactile Transformers, a framework for tactile representation learning that scales across multi-sensors and multi-tasks. T3 is designed to overcome the contemporary issue that camera-based tactile sensing is extremely heterogeneous, i.e. sensors are built into different form factors, and existing datasets were collected for disparate tasks. T3 captures the shared latent information across different sensor-task pairings by constructing a shared trunk transformer with sensor-specific encoders and task-specific decoders. The pre-training of T3 utilizes a novel Foundation Tactile (FoTa) dataset, which is aggregated from several open-sourced datasets and it contains over 3 million data points gathered from 13 sensors and 11 tasks. FoTa is the largest and most diverse dataset in tactile sensing to date and it is made publicly available in a unified format. Across various sensors and tasks, experiments show that T3 pre-trained with FoTa achieved zero-shot transferability in certain sensor-task pairings, can be further fine-tuned with small amounts of domain-specific data, and its performance scales with bigger network sizes. T3 is also effective as a tactile encoder for long horizon contact-rich manipulation. Results from sub-millimeter multi-pin electronics insertion tasks show that T3 achieved a task success rate 25% higher than that of policies trained with tactile encoders trained from scratch, or 53% higher than without tactile sensing. Data, code, and model checkpoints are open-sourced at this https URL.
- [339] arXiv:2406.13641 [pdf, html, other]
-
Title: Minimalist exploration strategies for robot swarms at the edge of chaosComments: This work has been submitted to the IEEE RA-L for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Robotics (cs.RO); Multiagent Systems (cs.MA)
Effective exploration abilities are fundamental for robot swarms, especially when small, inexpensive robots are employed (e.g., micro- or nano-robots). Random walks are often the only viable choice if robots are too constrained regarding sensors and computation to implement state-of-the-art solutions. However, identifying the best random walk parameterisation may not be trivial. Additionally, variability among robots in terms of motion abilities-a very common condition when precise calibration is not possible-introduces the need for flexible solutions. This study explores how random walks that present chaotic or edge-of-chaos dynamics can be generated. We also evaluate their effectiveness for a simple exploration task performed by a swarm of simulated Kilobots. First, we show how Random Boolean Networks can be used as controllers for the Kilobots, achieving a significant performance improvement compared to the best parameterisation of a Lévy-modulated Correlated Random Walk. Second, we demonstrate how chaotic dynamics are beneficial to maximise exploration effectiveness. Finally, we demonstrate how the exploration behavior produced by Boolean Networks can be optimized through an Evolutionary Robotics approach while maintaining the chaotic dynamics of the networks.
- [340] arXiv:2406.13642 [pdf, html, other]
-
Title: SpatialBot: Precise Spatial Understanding with Vision Language ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV)
Vision Language Models (VLMs) have achieved impressive performance in 2D image understanding, however they are still struggling with spatial understanding which is the foundation of Embodied AI. In this paper, we propose SpatialBot for better spatial understanding by feeding both RGB and depth images. Additionally, we have constructed the SpatialQA dataset, which involves multi-level depth-related questions to train VLMs for depth understanding. Finally, we present SpatialBench to comprehensively evaluate VLMs' capabilities in spatial understanding at different levels. Extensive experiments on our spatial-understanding benchmark, general VLM benchmarks and Embodied AI tasks, demonstrate the remarkable improvements of SpatialBot trained on SpatialQA. The model, code and data are available at this https URL.
- [341] arXiv:2406.13644 [pdf, html, other]
-
Title: Kinetic Monte Carlo methods for three-dimensional diffusive capture problems in exterior domainsComments: 32 pages, 10 figuresSubjects: Numerical Analysis (math.NA); Analysis of PDEs (math.AP); Biological Physics (physics.bio-ph); Quantitative Methods (q-bio.QM)
Cellular scale decision making is modulated by the dynamics of signalling molecules and their diffusive trajectories from a source to small absorbing sites on the cellular surface. Diffusive capture problems are computationally challenging due to the complex geometry and the applied boundary conditions together with intrinsically long transients that occur before a particle is captured. This paper reports on a particle-based Kinetic Monte Carlo (KMC) method that provides rapid accurate simulation of arrival statistics for (i) a half-space bounded by a surface with a finite collection of absorbing traps and (ii) the domain exterior to a convex cell again with absorbing traps. We validate our method by replicating classical results and in addition, newly developed boundary homogenization theories and matched asymptotic expansions on capture rates. In the case of non-spherical domains, we describe a new shielding effect in which geometry can play a role in sharpening cellular estimates on the directionality of diffusive sources.
- [342] arXiv:2406.13650 [pdf, html, other]
-
Title: Advanced Maximum Adhesion Tracking Strategies in Railway Traction DrivesAhmed Fathy Abouzeid, Juan Manuel Guerrero, Lander Lejarza, Iker Muniategui, Aitor Endemaño, Fernando BrizComments: 16 pages, 21 figuresJournal-ref: IEEE Transactions on Transportation Electrification, vol. 10, no. 2, pp. 3645-3660, June 2024Subjects: Systems and Control (eess.SY)
Modern railway traction systems are often equipped with anti-slip control strategies to comply with performance and safety requirements. A certain amount of slip is needed to increase the torque transferred by the traction motors onto the rail. Commonly, constant slip control is used to limit the slip velocity between the wheel and rail avoiding excessive slippage and vehicle derailment. This is at the price of not fully utilizing the train's traction and braking capabilities. Finding the slip at which maximum traction force occurs is challenging due to the non-linear relationship between slip and wheel-rail adhesion coefficient, as well as to its dependence on rail and wheel conditions. Perturb and observe (P\&O) and steepest gradient (SG) methods have been reported for the Maximum Adhesion Tracking (MAT) search. However, both methods exhibit weaknesses. Two new MAT strategies are proposed in this paper which overcome the limitations of existing methods, using Fuzzy Logic Controller (FLC) and Particle Swarm Optimization (PSO) respectively. Existing and proposed methods are first simulated and further validated experimentally using a scaled roller rig under identical conditions. The results show that the proposed methods improve the traction capability with lower searching time and oscillations compared to existing solutions. Tuning complexity and computational requirements will also be shown to be favorable to the proposed methods.
- [343] arXiv:2406.13652 [pdf, html, other]
-
Title: Stability and Generalizability in SDE Diffusion Models with Measure-Preserving DynamicsSubjects: Artificial Intelligence (cs.AI)
Inverse problems describe the process of estimating the causal factors from a set of measurements or data. Mapping of often incomplete or degraded data to parameters is ill-posed, thus data-driven iterative solutions are required, for example when reconstructing clean images from poor signals. Diffusion models have shown promise as potent generative tools for solving inverse problems due to their superior reconstruction quality and their compatibility with iterative solvers. However, most existing approaches are limited to linear inverse problems represented as Stochastic Differential Equations (SDEs). This simplification falls short of addressing the challenging nature of real-world problems, leading to amplified cumulative errors and biases. We provide an explanation for this gap through the lens of measure-preserving dynamics of Random Dynamical Systems (RDS) with which we analyse Temporal Distribution Discrepancy and thus introduce a theoretical framework based on RDS for SDE diffusion models. We uncover several strategies that inherently enhance the stability and generalizability of diffusion models for inverse problems and introduce a novel score-based diffusion framework, the \textbf{D}ynamics-aware S\textbf{D}E \textbf{D}iffusion \textbf{G}enerative \textbf{M}odel (D$^3$GM). The \textit{Measure-preserving property} can return the degraded measurement to the original state despite complex degradation with the RDS concept of \textit{stability}. Our extensive experimental results corroborate the effectiveness of D$^3$GM across multiple benchmarks including a prominent application for inverse problems, magnetic resonance imaging. Code and data will be publicly available.
- [344] arXiv:2406.13653 [pdf, html, other]
-
Title: Controlling Forgetting with Test-Time Data in Continual LearningComments: 9 pages, 2 figuresSubjects: Machine Learning (cs.LG)
Foundational vision-language models have shown impressive performance on various downstream tasks. Yet, there is still a pressing need to update these models later as new tasks or domains become available. Ongoing Continual Learning (CL) research provides techniques to overcome catastrophic forgetting of previous information when new knowledge is acquired. To date, CL techniques focus only on the supervised training sessions. This results in significant forgetting yielding inferior performance to even the prior model zero shot performance. In this work, we argue that test-time data hold great information that can be leveraged in a self supervised manner to refresh the model's memory of previous learned tasks and hence greatly reduce forgetting at no extra labelling cost. We study how unsupervised data can be employed online to improve models' performance on prior tasks upon encountering representative samples. We propose a simple yet effective student-teacher model with gradient based sparse parameters updates and show significant performance improvements and reduction in forgetting, which could alleviate the role of an offline episodic memory/experience replay buffer.
- [345] arXiv:2406.13655 [pdf, html, other]
-
Title: Improving GFlowNets with Monte Carlo Tree SearchComments: ICML 2024 SPIGM WorkshopSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Generative Flow Networks (GFlowNets) treat sampling from distributions over compositional discrete spaces as a sequential decision-making problem, training a stochastic policy to construct objects step by step. Recent studies have revealed strong connections between GFlowNets and entropy-regularized reinforcement learning. Building on these insights, we propose to enhance planning capabilities of GFlowNets by applying Monte Carlo Tree Search (MCTS). Specifically, we show how the MENTS algorithm (Xiao et al., 2019) can be adapted for GFlowNets and used during both training and inference. Our experiments demonstrate that this approach improves the sample efficiency of GFlowNet training and the generation fidelity of pre-trained GFlowNet models.
- [346] arXiv:2406.13656 [pdf, html, other]
-
Title: Tactical Game-theoretic Decision-making with Homotopy Class ConstraintsSubjects: Multiagent Systems (cs.MA); Robotics (cs.RO)
We propose a tactical homotopy-aware decision-making framework for game-theoretic motion planning in urban environments. We model urban driving as a generalized Nash equilibrium problem and employ a mixed-integer approach to tame the combinatorial aspect of motion planning. More specifically, by utilizing homotopy classes, we partition the high-dimensional solution space into finite, well-defined subregions. Each subregion (homotopy) corresponds to a high-level tactical decision, such as the passing order between pairs of players. The proposed formulation allows to find global optimal Nash equilibria in a computationally tractable manner by solving a mixed-integer quadratic program. Each homotopy decision is represented by a binary variable that activates different sets of linear collision avoidance constraints. This extra homotopic constraint allows to find solutions in a more efficient way (on a roundabout scenario on average 5-times faster). We experimentally validate the proposed approach on scenarios taken from the rounD dataset. Simulation-based testing in receding horizon fashion demonstrates the capability of the framework in achieving globally optimal solutions while yielding a 78% average decrease in the computational time with respect to an implementation without the homotopic constraints.
- [347] arXiv:2406.13657 [pdf, html, other]
-
Title: The strength of the dominance ruleComments: To appear in the proceedings of the 27th International Conference on Theory and Applications of Satisfiability Testing (SAT 2024)Subjects: Logic in Computer Science (cs.LO); Logic (math.LO)
It has become standard that, when a SAT solver decides that a CNF $\Gamma$ is unsatisfiable, it produces a certificate of unsatisfiability in the form of a refutation of $\Gamma$ in some proof system. The system typically used is DRAT, which is equivalent to extended resolution (ER) -- for example, until this year DRAT refutations were required in the annual SAT competition. Recently [Bogaerts et al.~2023] introduced a new proof system, associated with the tool VeriPB, which is at least as strong as DRAT and is further able to handle certain symmetry-breaking techniques. We show that this system simulates the proof system $G_1$, which allows limited reasoning with QBFs and forms the first level above ER in a natural hierarchy of proof systems. This hierarchy is not known to be strict, but nevertheless this is evidence that the system of [Bogaerts et al. 2023] is plausibly strictly stronger than ER and DRAT. In the other direction, we show that symmetry-breaking for a single symmetry can be handled inside ER.
- [348] arXiv:2406.13659 [pdf, html, other]
-
Title: Leveraging Large Language Models for Patient Engagement: The Power of Conversational AI in Digital HealthComments: 10 pages, 6 figures, ICDH 2024 invited paperSubjects: Artificial Intelligence (cs.AI)
The rapid advancements in large language models (LLMs) have opened up new opportunities for transforming patient engagement in healthcare through conversational AI. This paper presents an overview of the current landscape of LLMs in healthcare, specifically focusing on their applications in analyzing and generating conversations for improved patient engagement. We showcase the power of LLMs in handling unstructured conversational data through four case studies: (1) analyzing mental health discussions on Reddit, (2) developing a personalized chatbot for cognitive engagement in seniors, (3) summarizing medical conversation datasets, and (4) designing an AI-powered patient engagement system. These case studies demonstrate how LLMs can effectively extract insights and summarizations from unstructured dialogues and engage patients in guided, goal-oriented conversations. Leveraging LLMs for conversational analysis and generation opens new doors for many patient-centered outcomes research opportunities. However, integrating LLMs into healthcare raises important ethical considerations regarding data privacy, bias, transparency, and regulatory compliance. We discuss best practices and guidelines for the responsible development and deployment of LLMs in healthcare settings. Realizing the full potential of LLMs in digital health will require close collaboration between the AI and healthcare professionals communities to address technical challenges and ensure these powerful tools' safety, efficacy, and equity.
- [349] arXiv:2406.13660 [pdf, html, other]
-
Title: Towards Minimal Targeted Updates of Language Models with Targeted Negative TrainingComments: Published in Transactions of Machine Learning ResearchSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Generative models of language exhibit impressive capabilities but still place non-negligible probability mass over undesirable outputs. In this work, we address the task of updating a model to avoid unwanted outputs while minimally changing model behavior otherwise, a challenge we refer to as a minimal targeted update. We first formalize the notion of a minimal targeted update and propose a method to achieve such updates using negative examples from a model's generations. Our proposed Targeted Negative Training (TNT) results in updates that keep the new distribution close to the original, unlike existing losses for negative signal which push down probability but do not control what the updated distribution will be. In experiments, we demonstrate that TNT yields a better trade-off between reducing unwanted behavior and maintaining model generation behavior than baselines, paving the way towards a modeling paradigm based on iterative training updates that constrain models from generating undesirable outputs while preserving their impressive capabilities.
- [350] arXiv:2406.13661 [pdf, html, other]
-
Title: Hitchhiker's guide on Energy-Based Models: a comprehensive review on the relation with other generative models, sampling and statistical physicsDavide Carbone (1 and 2) ((1) Dipartimento di Scienze Matematiche, Politecnico di Torino, Torino, Italy, (2) INFN, Sezione di Torino, Torino, Italy)Subjects: Machine Learning (cs.LG); Mathematical Physics (math-ph); Applied Physics (physics.app-ph); Data Analysis, Statistics and Probability (physics.data-an)
Energy-Based Models (EBMs) have emerged as a powerful framework in the realm of generative modeling, offering a unique perspective that aligns closely with principles of statistical mechanics. This review aims to provide physicists with a comprehensive understanding of EBMs, delineating their connection to other generative models such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Normalizing Flows. We explore the sampling techniques crucial for EBMs, including Markov Chain Monte Carlo (MCMC) methods, and draw parallels between EBM concepts and statistical mechanics, highlighting the significance of energy functions and partition functions. Furthermore, we delve into state-of-the-art training methodologies for EBMs, covering recent advancements and their implications for enhanced model performance and efficiency. This review is designed to clarify the often complex interconnections between these models, which can be challenging due to the diverse communities working on the topic.
- [351] arXiv:2406.13662 [pdf, html, other]
-
Title: ObscurePrompt: Jailbreaking Large Language Models via Obscure InputSubjects: Computation and Language (cs.CL)
Recently, Large Language Models (LLMs) have garnered significant attention for their exceptional natural language processing capabilities. However, concerns about their trustworthiness remain unresolved, particularly in addressing "jailbreaking" attacks on aligned LLMs. Previous research predominantly relies on scenarios with white-box LLMs or specific and fixed prompt templates, which are often impractical and lack broad applicability. In this paper, we introduce a straightforward and novel method, named ObscurePrompt, for jailbreaking LLMs, inspired by the observed fragile alignments in Out-of-Distribution (OOD) data. Specifically, we first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary. ObscurePrompt starts with constructing a base prompt that integrates well-known jailbreaking techniques. Powerful LLMs are then utilized to obscure the original prompt through iterative transformations, aiming to bolster the attack's robustness. Comprehensive experiments show that our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms. We believe that our work can offer fresh insights for future research on enhancing LLM alignment.
- [352] arXiv:2406.13663 [pdf, html, other]
-
Title: Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented GenerationComments: Under review. Code and data released at this https URLSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Ensuring the verifiability of model answers is a fundamental challenge for retrieval-augmented generation (RAG) in the question answering (QA) domain. Recently, self-citation prompting was proposed to make large language models (LLMs) generate citations to supporting documents along with their answers. However, self-citing LLMs often struggle to match the required format, refer to non-existent sources, and fail to faithfully reflect LLMs' context usage throughout the generation. In this work, we present MIRAGE --Model Internals-based RAG Explanations -- a plug-and-play approach using model internals for faithful answer attribution in RAG applications. MIRAGE detects context-sensitive answer tokens and pairs them with retrieved documents contributing to their prediction via saliency methods. We evaluate our proposed approach on a multilingual extractive QA dataset, finding high agreement with human answer attribution. On open-ended QA, MIRAGE achieves citation quality and efficiency comparable to self-citation while also allowing for a finer-grained control of attribution parameters. Our qualitative evaluation highlights the faithfulness of MIRAGE's attributions and underscores the promising application of model internals for RAG answer attribution.
- [353] arXiv:2406.13664 [pdf, html, other]
-
Title: Root-KGD: A Novel Framework for Root Cause Diagnosis Based on Knowledge Graph and Industrial DataSubjects: Artificial Intelligence (cs.AI)
With the development of intelligent manufacturing and the increasing complexity of industrial production, root cause diagnosis has gradually become an important research direction in the field of industrial fault diagnosis. However, existing research methods struggle to effectively combine domain knowledge and industrial data, failing to provide accurate, online, and reliable root cause diagnosis results for industrial processes. To address these issues, a novel fault root cause diagnosis framework based on knowledge graph and industrial data, called Root-KGD, is proposed. Root-KGD uses the knowledge graph to represent domain knowledge and employs data-driven modeling to extract fault features from industrial data. It then combines the knowledge graph and data features to perform knowledge graph reasoning for root cause identification. The performance of the proposed method is validated using two industrial process cases, Tennessee Eastman Process (TEP) and Multiphase Flow Facility (MFF). Compared to existing methods, Root-KGD not only gives more accurate root cause variable diagnosis results but also provides interpretable fault-related information by locating faults to corresponding physical entities in knowledge graph (such as devices and streams). In addition, combined with its lightweight nature, Root-KGD is more effective in online industrial applications.
- [354] arXiv:2406.13665 [pdf, html, other]
-
Title: Challenges in Binary ClassificationSubjects: Machine Learning (cs.LG)
Binary Classification plays an important role in machine learning. For linear classification, SVM is the optimal binary classification method. For nonlinear classification, the SVM algorithm needs to complete the classification task by using the kernel function. Although the SVM algorithm with kernel function is very effective, the selection of kernel function is empirical, which means that the kernel function may not be optimal. Therefore, it is worth studying how to obtain an optimal binary classifier.
In this paper, the problem of finding the optimal binary classifier is considered as a variational problem. We design the objective function of this variational problem through the max-min problem of the (Euclidean) distance between two classes. For linear classification, it can be deduced that SVM is a special case of this variational problem framework. For Euclidean distance, it is proved that the proposed variational problem has some limitations for nonlinear classification. Therefore, how to design a more appropriate objective function to find the optimal binary classifier is still an open problem. Further, it's discussed some challenges and problems in finding the optimal classifier. - [355] arXiv:2406.13668 [pdf, html, other]
-
Title: Improved bounds for calibration via stronger sign preservation gamesYuval Dagan, Constantinos Daskalakis, Maxwell Fishelson, Noah Golowich, Robert Kleinberg, Princewill OkoroaforSubjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
A set of probabilistic forecasts is calibrated if each prediction of the forecaster closely approximates the empirical distribution of outcomes on the subset of timesteps where that prediction was made. We study the fundamental problem of online calibrated forecasting of binary sequences, which was initially studied by Foster & Vohra (1998). They derived an algorithm with $O(T^{2/3})$ calibration error after $T$ time steps, and showed a lower bound of $\Omega(T^{1/2})$. These bounds remained stagnant for two decades, until Qiao & Valiant (2021) improved the lower bound to $\Omega(T^{0.528})$ by introducing a combinatorial game called sign preservation and showing that lower bounds for this game imply lower bounds for calibration.
We introduce a strengthening of Qiao & Valiant's game that we call sign preservation with reuse (SPR). We prove that the relationship between SPR and calibrated forecasting is bidirectional: not only do lower bounds for SPR translate into lower bounds for calibration, but algorithms for SPR also translate into new algorithms for calibrated forecasting. In particular, any strategy that improves the trivial upper bound for the value of the SPR game would imply a forecasting algorithm with calibration error exponent less than 2/3, improving Foster & Vohra's upper bound for the first time. Using similar ideas, we then prove a slightly stronger lower bound than that of Qiao & Valiant, namely $\Omega(T^{0.54389})$. Our lower bound is obtained by an oblivious adversary, marking the first $\omega(T^{1/2})$ calibration lower bound for oblivious adversaries. - [356] arXiv:2406.13672 [pdf, html, other]
-
Title: Q-SNNs: Quantized Spiking Neural NetworksWenjie Wei, Yu Liang, Ammar Belatreche, Yichen Xiao, Honglin Cao, Zhenbang Ren, Guoqing Wang, Malu Zhang, Yang YangComments: 8 pages, 5 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Brain-inspired Spiking Neural Networks (SNNs) leverage sparse spikes to represent information and process them in an asynchronous event-driven manner, offering an energy-efficient paradigm for the next generation of machine intelligence. However, the current focus within the SNN community prioritizes accuracy optimization through the development of large-scale models, limiting their viability in resource-constrained and low-power edge devices. To address this challenge, we introduce a lightweight and hardware-friendly Quantized SNN (Q-SNN) that applies quantization to both synaptic weights and membrane potentials. By significantly compressing these two key elements, the proposed Q-SNNs substantially reduce both memory usage and computational complexity. Moreover, to prevent the performance degradation caused by this compression, we present a new Weight-Spike Dual Regulation (WS-DR) method inspired by information entropy theory. Experimental evaluations on various datasets, including static and neuromorphic, demonstrate that our Q-SNNs outperform existing methods in terms of both model size and accuracy. These state-of-the-art results in efficiency and efficacy suggest that the proposed method can significantly improve edge intelligent computing.
- [357] arXiv:2406.13675 [pdf, html, other]
-
Title: AI-Assisted Dynamic Port and Waveform Switching for Enhancing UL Coverage in 5G NRAlejandro Villena-Rodríguez, Gerardo Gómez, Mari Carmen Aguayo-Torres, Francisco J. Martín-Vega, José Outes-Carnero, F. Yak Ng-Molina, Juan Ramiro-MorenoComments: 5 pages, 3 figures, 4 tablesSubjects: Information Theory (cs.IT)
The uplink of 5G networks allows selecting the transmit waveform between cyclic prefix orthogonal frequency division multiplexing (CP-OFDM) and discrete Fourier transform spread OFDM (DFT-S-OFDM), which is appealing for cell-edge users using high-frequency bands, since it shows a smaller peak-to-average power ratio, and allows a higher transmit power. Nevertheless, DFT-S-OFDM exhibits a higher block error rate (BLER) which complicates an optimal waveform selection. In this paper, we propose an intelligent waveform-switching mechanism based on deep reinforcement learning (DRL). In this proposal, a learning agent aims at maximizing a function built using available throughput percentiles in real networks. Said percentiles are weighted so as to improve the cell-edge users' service without dramatically reducing the cell average. Aggregated measurements of signal-to-noise ratio (SNR) and timing advance (TA), available in real networks, are used in the procedure. Results show that our proposed scheme greatly outperforms both metrics compared to classical approaches.
- [358] arXiv:2406.13677 [pdf, html, other]
-
Title: Leveraging Large Language Models to Measure Gender Bias in Gendered LanguagesSubjects: Computation and Language (cs.CL); Computers and Society (cs.CY)
Gender bias in text corpora used in various natural language processing (NLP) contexts, such as for training large language models (LLMs), can lead to the perpetuation and amplification of societal inequalities. This is particularly pronounced in gendered languages like Spanish or French, where grammatical structures inherently encode gender, making the bias analysis more challenging. Existing methods designed for English are inadequate for this task due to the intrinsic linguistic differences between English and gendered languages. This paper introduces a novel methodology that leverages the contextual understanding capabilities of LLMs to quantitatively analyze gender representation in Spanish corpora. By utilizing LLMs to identify and classify gendered nouns and pronouns in relation to their reference to human entities, our approach provides a nuanced analysis of gender biases. We empirically validate our method on four widely-used benchmark datasets, uncovering significant gender disparities with a male-to-female ratio ranging from 4:1 to 6:1. These findings demonstrate the value of our methodology for bias quantification in gendered languages and suggest its application in NLP, contributing to the development of more equitable language technologies.
- [359] arXiv:2406.13679 [pdf, html, other]
-
Title: Prose-to-P4: Leveraging High Level LanguagesSubjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
Languages such as P4 and NPL have enabled a wide and diverse range of networking applications that take advantage of programmable dataplanes. However, software development in these languages is difficult. To address this issue, high-level languages have been designed to offer programmers powerful abstractions that reduce the time, effort and domain-knowledge required for developing networking applications. These languages are then translated by a compiler into P4/NPL code. Inspired by the recent success of Large Language Models (LLMs) in the task of code generation, we propose to raise the level of abstraction even higher, employing LLMs to translate prose into high-level networking code. We analyze the problem, focusing on the motivation and opportunities, as well as the challenges involved and sketch out a roadmap for the development of a system that can generate high-level dataplane code from natural language instructions. We present some promising preliminary results on generating Lucid code from natural language.
- [360] arXiv:2406.13681 [pdf, other]
-
Title: On the Consistency of Fairness Measurement Methods for Regression TasksComments: Accepted at ijcai24 - The Trustworthy AI WorkshopSubjects: Machine Learning (cs.LG)
With growing applications of Machine Learning (ML) techniques in the real world, it is highly important to ensure that these models work in an equitable manner. One main step in ensuring fairness is to effectively measure fairness, and to this end, various metrics have been proposed in the past literature. While the computation of those metrics are straightforward in the classification set-up, it is computationally intractable in the regression domain. To address the challenge of computational intractability, past literature proposed various methods to approximate such metrics. However, they did not verify the extent to which the output of such approximation algorithms are consistent with each other. To fill this gap, this paper comprehensively studies the consistency of the output of various fairness measurement methods through conducting an extensive set of experiments on various regression tasks. As a result, it finds that while some fairness measurement approaches show strong consistency across various regression tasks, certain methods show a relatively poor consistency in certain regression tasks. This, in turn, calls for a more principled approach for measuring fairness in the regression domain.
- [361] arXiv:2406.13683 [pdf, html, other]
-
Title: IntCoOp: Interpretability-Aware Vision-Language Prompt TuningSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Image-text contrastive models such as CLIP learn transferable and robust representations for zero-shot transfer to a variety of downstream tasks. However, to obtain strong downstream performances, prompts need to be carefully curated, which can be a tedious engineering task. To address the issue of manual prompt engineering, prompt-tuning is used where a set of contextual vectors are learned by leveraging information from the training data. Despite their effectiveness, existing prompt-tuning frameworks often lack interpretability, thus limiting their ability to understand the compositional nature of images. In this work, we first identify that incorporating compositional attributes (e.g., a "green" tree frog) in the design of manual prompts can significantly enhance image-text alignment scores. Building upon this observation, we propose a novel and interpretable prompt-tuning method named IntCoOp, which learns to jointly align attribute-level inductive biases and class embeddings during prompt-tuning. To assess the effectiveness of our approach, we evaluate IntCoOp across two representative tasks in a few-shot learning setup: generalization to novel classes, and unseen domain shifts. Through extensive experiments across 10 downstream datasets on CLIP, we find that introducing attribute-level inductive biases leads to superior performance against state-of-the-art prompt tuning frameworks. Notably, in a 16-shot setup, IntCoOp improves CoOp by 7.35% in average performance across 10 diverse datasets.
- [362] arXiv:2406.13688 [pdf, html, other]
-
Title: Development of a Dual-Input Neural Model for Detecting AI-Generated ImagerySubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Over the past years, images generated by artificial intelligence have become more prevalent and more realistic. Their advent raises ethical questions relating to misinformation, artistic expression, and identity theft, among others. The crux of many of these moral questions is the difficulty in distinguishing between real and fake images. It is important to develop tools that are able to detect AI-generated images, especially when these images are too realistic-looking for the human eye to identify as fake. This paper proposes a dual-branch neural network architecture that takes both images and their Fourier frequency decomposition as inputs. We use standard CNN-based methods for both branches as described in Stuchi et al. [7], followed by fully-connected layers. Our proposed model achieves an accuracy of 94% on the CIFAKE dataset, which significantly outperforms classic ML methods and CNNs, achieving performance comparable to some state-of-the-art architectures, such as ResNet.
- [363] arXiv:2406.13692 [pdf, html, other]
-
Title: Synchronous Faithfulness Monitoring for Trustworthy Retrieval-Augmented GenerationSubjects: Computation and Language (cs.CL)
Retrieval-augmented language models (RALMs) have shown strong performance and wide applicability in knowledge-intensive tasks. However, there are significant trustworthiness concerns as RALMs are prone to generating unfaithful outputs, including baseless information or contradictions with the retrieved context. This paper proposes SynCheck, a lightweight monitor that leverages fine-grained decoding dynamics including sequence likelihood, uncertainty quantification, context influence, and semantic alignment to synchronously detect unfaithful sentences. By integrating efficiently measurable and complementary signals, SynCheck enables accurate and immediate feedback and intervention, achieving 0.85 AUROC in detecting faithfulness errors across six long-form retrieval-augmented generation tasks, improving prior best method by 4%. Leveraging SynCheck, we further introduce FOD, a faithfulness-oriented decoding algorithm guided by beam search for long-form retrieval-augmented generation. Empirical results demonstrate that FOD outperforms traditional strategies such as abstention, reranking, or contrastive decoding significantly in terms of faithfulness, achieving over 10% improvement across six datasets.
- [364] arXiv:2406.13693 [pdf, html, other]
-
Title: From Single Agent to Multi-Agent: Improving Traffic Signal ControlComments: 13 pagesSubjects: Artificial Intelligence (cs.AI)
Due to accelerating urbanization, the importance of solving the signal control problem increases. This paper analyzes various existing methods and suggests options for increasing the number of agents to reduce the average travel time. Experiments were carried out with 2 datasets. The results show that in some cases, the implementation of multiple agents can improve existing methods. For a fine-tuned large language model approach there is small enhancement on all metrics.
- [365] arXiv:2406.13694 [pdf, html, other]
-
Title: An Embedded Intelligent System for Attendance MonitoringComments: 18 pages, 6 figuresSubjects: Artificial Intelligence (cs.AI)
In this paper, we propose an intelligent embedded system for monitoring class attendance and sending the attendance list to a remote computer. The proposed system consists of two parts : an embedded device (Raspberry with PI camera) for facial recognition and a web application for attendance management. The proposed solution take into account the different challenges: the limited resources of the Raspberry Pi, the need to adapt the facial recognition model and achieving acceptable performance using images provided by the Raspberry Pi camera.
- [366] arXiv:2406.13695 [pdf, other]
-
Title: Multilingual De-Duplication Strategies: Applying scalable similarity search with monolingual & multilingual embedding modelsComments: 6 pages, 3 figuresSubjects: Artificial Intelligence (cs.AI)
This paper addresses the deduplication of multilingual textual data using advanced NLP tools. We compare a two-step method involving translation to English followed by embedding with mpnet, and a multilingual embedding model (distiluse). The two-step approach achieved a higher F1 score (82% vs. 60%), particularly with less widely used languages, which can be increased up to 89% by leveraging expert rules based on domain knowledge. We also highlight limitations related to token length constraints and computational efficiency. Our methodology suggests improvements for future multilingual deduplication tasks.
- [367] arXiv:2406.13698 [pdf, html, other]
-
Title: MMTE: Corpus and Metrics for Evaluating Machine Translation Quality of Metaphorical LanguageSubjects: Computation and Language (cs.CL)
Machine Translation (MT) has developed rapidly since the release of Large Language Models and current MT evaluation is performed through comparison with reference human translations or by predicting quality scores from human-labeled data. However, these mainstream evaluation methods mainly focus on fluency and factual reliability, whilst paying little attention to figurative quality. In this paper, we investigate the figurative quality of MT and propose a set of human evaluation metrics focused on the translation of figurative language. We additionally present a multilingual parallel metaphor corpus generated by post-editing. Our evaluation protocol is designed to estimate four aspects of MT: Metaphorical Equivalence, Emotion, Authenticity, and Quality. In doing so, we observe that translations of figurative expressions display different traits from literal ones.
- [368] arXiv:2406.13700 [pdf, html, other]
-
Title: Reinforcement Learning-Based Model Matching to Reduce the Sim-Real Gap in COBRASubjects: Robotics (cs.RO)
This paper employs a reinforcement learning-based model identification method aimed at enhancing the accuracy of the dynamics for our snake robot, called COBRA. Leveraging gradient information and iterative optimization, the proposed approach refines the parameters of COBRA's dynamical model such as coefficient of friction and actuator parameters using experimental and simulated data. Experimental validation on the hardware platform demonstrates the efficacy of the proposed approach, highlighting its potential to address sim-to-real gap in robot implementation.
- [369] arXiv:2406.13706 [pdf, html, other]
-
Title: Breaking News: Case Studies of Generative AI's Use in JournalismSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Journalists are among the many users of large language models (LLMs). To better understand the journalist-AI interactions, we conduct a study of LLM usage by two news agencies through browsing the WildChat dataset, identifying candidate interactions, and verifying them by matching to online published articles. Our analysis uncovers instances where journalists provide sensitive material such as confidential correspondence with sources or articles from other agencies to the LLM as stimuli and prompt it to generate articles, and publish these machine-generated articles with limited intervention (median output-publication ROUGE-L of 0.62). Based on our findings, we call for further research into what constitutes responsible use of AI, and the establishment of clear guidelines and best practices on using LLMs in a journalistic context.
- [370] arXiv:2406.13707 [pdf, html, other]
-
Title: Safety-Critical Formation Control of Non-Holonomic Multi-Robot Systems in Communication-Limited EnvironmentsComments: Under reviewSubjects: Systems and Control (eess.SY); Robotics (cs.RO)
This paper presents a robust estimator-based safety-critical controller for formation control of non-holonomic mobile robots in communication-limited environments. The proposed decentralized framework integrates a robust state estimator with a formation tracking control law that guarantees inter-agent collision avoidance using control barrier functions. String stability is incorporated into the control design to maintain stability against noise from predecessors in leader-follower formations. Rigorous stability analysis using Lyapunov functions ensures the stability of estimation errors and the convergence of the formation to desired configurations. The effectiveness and robustness of the proposed approach are validated through numerical simulations of various maneuvers and realistic Gazebo experiments involving formations in a warehouse environment. The results demonstrate the controller's ability to maintain safety, achieve precise formation control, and mitigate disturbances in scenarios without inter-robot communication.
- [371] arXiv:2406.13711 [pdf, html, other]
-
Title: Imagining In-distribution States: How Predictable Robot Behavior Can Enable User Control Over Learned PoliciesComments: Accepted to IEEE RO-MAN 2024 as a regular paper. arXiv admin note: substantial text overlap with arXiv:2312.05991Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
It is crucial that users are empowered to take advantage of the functionality of a robot and use their understanding of that functionality to perform novel and creative tasks. Given a robot trained with Reinforcement Learning (RL), a user may wish to leverage that autonomy along with their familiarity of how they expect the robot to behave to collaborate with the robot. One technique is for the user to take control of some of the robot's action space through teleoperation, allowing the RL policy to simultaneously control the rest. We formalize this type of shared control as Partitioned Control (PC). However, this may not be possible using an out-of-the-box RL policy. For example, a user's control may bring the robot into a failure state from the policy's perspective, causing it to act unexpectedly and hindering the success of the user's desired task. In this work, we formalize this problem and present Imaginary Out-of-Distribution Actions, IODA, an initial algorithm which empowers users to leverage their expectations of a robot's behavior to accomplish new tasks. We deploy IODA in a user study with a real robot and find that IODA leads to both better task performance and a higher degree of alignment between robot behavior and user expectation. We also show that in PC, there is a strong and significant correlation between task performance and the robot's ability to meet user expectations, highlighting the need for approaches like IODA. Code is available at this https URL
- [372] arXiv:2406.13712 [pdf, html, other]
-
Title: Convex-hull Estimation using XPSNR for Versatile Video CodingComments: Accepted at 2024 IEEE International Conference on Image Processing (ICIP)Subjects: Multimedia (cs.MM); Image and Video Processing (eess.IV)
As adaptive streaming becomes crucial for delivering high-quality video content across diverse network conditions, accurate metrics to assess perceptual quality are essential. This paper explores using the eXtended Peak Signal-to-Noise Ratio (XPSNR) metric as an alternative to the popular Video Multimethod Assessment Fusion (VMAF) metric for determining optimized bitrate-resolution pairs in the context of Versatile Video Coding (VVC). Our study is rooted in the observation that XPSNR shows a superior correlation with subjective quality scores for VVC-coded Ultra-High Definition (UHD) content compared to VMAF. We predict the average XPSNR of VVC-coded bitstreams using spatiotemporal complexity features of the video and the target encoding configuration and then determine the convex-hull online. On average, the proposed convex-hull using XPSNR (VEXUS) achieves an overall quality improvement of 5.84 dB PSNR and 0.62 dB XPSNR while maintaining the same bitrate, compared to the default UHD encoding using the VVenC encoder, accompanied by an encoding time reduction of 44.43% and a decoding time reduction of 65.46%. This shift towards XPSNR as a guiding metric shall enhance the effectiveness of adaptive streaming algorithms, ensuring an optimal balance between bitrate efficiency and perceptual fidelity with advanced video coding standards.
- [373] arXiv:2406.13713 [pdf, html, other]
-
Title: Benchmarking Open-Source Language Models for Efficient Question Answering in Industrial ApplicationsMahaman Sanoussi Yahaya Alassan, Jessica López Espejel, Merieme Bouhandi, Walid Dahhane, El Hassane EttifouriComments: Preprint submitted to Engineering Applications of Artificial IntelligenceSubjects: Computation and Language (cs.CL)
In the rapidly evolving landscape of Natural Language Processing (NLP), Large Language Models (LLMs) have demonstrated remarkable capabilities in tasks such as question answering (QA). However, the accessibility and practicality of utilizing these models for industrial applications pose significant challenges, particularly concerning cost-effectiveness, inference speed, and resource efficiency. This paper presents a comprehensive benchmarking study comparing open-source LLMs with their non-open-source counterparts on the task of question answering. Our objective is to identify open-source alternatives capable of delivering comparable performance to proprietary models while being lightweight in terms of resource requirements and suitable for Central Processing Unit (CPU)-based inference. Through rigorous evaluation across various metrics including accuracy, inference speed, and resource consumption, we aim to provide insights into selecting efficient LLMs for real-world applications. Our findings shed light on viable open-source alternatives that offer acceptable performance and efficiency, addressing the pressing need for accessible and efficient NLP solutions in industry settings.
- [374] arXiv:2406.13714 [pdf, html, other]
-
Title: BEACON: Balancing Convenience and Nutrition in Meals With Long-Term Group Recommendations and Reasoning on Multimodal RecipesComments: 6 pages (including references), 1 figure, 2 tablesSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
A common, yet regular, decision made by people, whether healthy or with any health condition, is to decide what to have in meals like breakfast, lunch, and dinner, consisting of a combination of foods for appetizer, main course, side dishes, desserts, and beverages. However, often this decision is seen as a trade-off between nutritious choices (e.g., low salt and sugar) or convenience (e.g., inexpensive, fast to prepare/obtain, taste better). In this preliminary work, we present a data-driven approach for the novel meal recommendation problem that can explore and balance choices for both considerations while also reasoning about a food's constituents and cooking process. Beyond the problem formulation, our contributions also include a goodness measure, a recipe conversion method from text to the recently introduced multimodal rich recipe representation (R3) format, and learning methods using contextual bandits that show promising results.
- [375] arXiv:2406.13715 [pdf, html, other]
-
Title: Converging Dimensions: Information Extraction and Summarization through Multisource, Multimodal, and Multilingual FusionComments: 11 pages, 3 figuresSubjects: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Recent advances in large language models (LLMs) have led to new summarization strategies, offering an extensive toolkit for extracting important information. However, these approaches are frequently limited by their reliance on isolated sources of data. The amount of information that can be gathered is limited and covers a smaller range of themes, which introduces the possibility of falsified content and limited support for multilingual and multimodal data. The paper proposes a novel approach to summarization that tackles such challenges by utilizing the strength of multiple sources to deliver a more exhaustive and informative understanding of intricate topics. The research progresses beyond conventional, unimodal sources such as text documents and integrates a more diverse range of data, including YouTube playlists, pre-prints, and Wikipedia pages. The aforementioned varied sources are then converted into a unified textual representation, enabling a more holistic analysis. This multifaceted approach to summary generation empowers us to extract pertinent information from a wider array of sources. The primary tenet of this approach is to maximize information gain while minimizing information overlap and maintaining a high level of informativeness, which encourages the generation of highly coherent summaries.
- [376] arXiv:2406.13716 [pdf, other]
-
Title: MultiChor: Census Polymorphic Choreographic Programming with Multiply Located ValuesSubjects: Programming Languages (cs.PL); Distributed, Parallel, and Cluster Computing (cs.DC)
Choreographic programming is a concurrent paradigm in which a single global program called a choreography describes behavior across an entire distributed network of participants. Choreographies are easier to reason about than separate programs running in parallel, and choreographic programming systems can check for deadlocks statically.
We present MultiChor, a library for writing and running choreographies as monadic values in Haskell. Unlike prior Haskell implementations, MultiChor does not require excess communication to handle Knowledge-of-Choice. Unlike all prior general-purpose choreographic languages, MultiChor can express choreographies that are polymorphic over the number of participants. - [377] arXiv:2406.13718 [pdf, html, other]
-
Title: Evaluating Large Language Models along Dimensions of Language Variation: A Systematik Invesdigatiom uv Cross-lingual GeneralizationComments: 9 pagesSubjects: Computation and Language (cs.CL)
While large language models exhibit certain cross-lingual generalization capabilities, they suffer from performance degradation (PD) on unseen closely-related languages (CRLs) and dialects relative to their high-resource language neighbour (HRLN). However, we currently lack a fundamental understanding of what kinds of linguistic distances contribute to PD, and to what extent. Furthermore, studies of cross-lingual generalization are confounded by unknown quantities of CRL language traces in the training data, and by the frequent lack of availability of evaluation data in lower-resource related languages and dialects. To address these issues, we model phonological, morphological, and lexical distance as Bayesian noise processes to synthesize artificial languages that are controllably distant from the HRLN. We analyse PD as a function of underlying noise parameters, offering insights on model robustness to isolated and composed linguistic phenomena, and the impact of task and HRL characteristics on PD. We calculate parameter posteriors on real CRL-HRLN pair data and show that they follow computed trends of artificial languages, demonstrating the viability of our noisers. Our framework offers a cheap solution to estimating task performance on an unseen CRL given HRLN performance using its posteriors, as well as for diagnosing observed PD on a CRL in terms of its linguistic distances from its HRLN, and opens doors to principled methods of mitigating performance degradation.
- [378] arXiv:2406.13719 [pdf, html, other]
-
Title: GUI Action Narrator: Where and When Did That Action Take Place?Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Xiangwu Guo, Peiran Li, Weichen Zhang, Hengxu Wang, Mike Zheng ShouSubjects: Computer Vision and Pattern Recognition (cs.CV)
The advent of Multimodal LLMs has significantly enhanced image OCR recognition capabilities, making GUI automation a viable reality for increasing efficiency in digital tasks. One fundamental aspect of developing a GUI automation system is understanding primitive GUI actions. This comprehension is crucial as it enables agents to learn from user demonstrations, an essential element of automation. To rigorously evaluate such capabilities, we developed a video captioning benchmark for GUI actions, comprising 4,189 diverse video captioning samples. This task presents unique challenges compared to natural scene video captioning: 1) GUI screenshots typically contain denser information than natural scenes, and 2) events within GUIs are subtler and occur more rapidly, requiring precise attention to the appropriate time span and spatial region for accurate understanding. To address these challenges, we introduce our GUI action dataset \textbf{Act2Cap} as well as a simple yet effective framework, \textbf{GUI Narrator}, for GUI video captioning that utilizes the cursor as a visual prompt to enhance the interpretation of high-resolution screenshots. Specifically, a cursor detector is trained on our dataset, and a multimodal LLM model with mechanisms for selecting keyframes and key regions generates the captions. Experimental results indicate that even for today's most advanced multimodal models, such as GPT-4o, the task remains highly challenging. Additionally, our evaluations show that our strategy effectively enhances model performance, whether integrated into the fine-tuning of open-source models or employed as a prompting strategy in closed-source models.
- [379] arXiv:2406.13720 [pdf, html, other]
-
Title: On the Utility of Domain-Adjacent Fine-Tuned Model Ensembles for Few-shot ProblemsComments: Main paper is 8 pages, followed by limitations, references and appendixSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Large Language Models (LLMs) have been observed to perform well on a wide range of downstream tasks when fine-tuned on domain-specific data. However, such data may not be readily available in many applications, motivating zero-shot or few-shot approaches using domain-adjacent models. While several fine-tuned models for various tasks are available, finding an appropriate domain-adjacent model for a given task is often not straight forward. In this paper, we study DAFT-E, a framework that utilizes an Ensemble of Domain-Adjacent Fine-Tuned Foundation Models for few-shot problems. We show that for zero-shot problems, this ensembling method provides an accuracy performance close to that of the single best model. With few-shot problems, this performance improves further, at which point DEFT-E can outperform any single domain-adjacent model while requiring much less data for domain-specific fine-tuning.
- [380] arXiv:2406.13721 [pdf, html, other]
-
Title: Which One Changes More? A Novel Radial Visualization for State Change ComparisonSubjects: Human-Computer Interaction (cs.HC)
It is common to compare state changes of multiple data items and identify which data items have changed more in various applications (e.g., annual GDP growth of different countries and daily increase of new COVID-19 cases in different regions). Grouped bar charts and slope graphs can visualize both state changes and their initial and final states of multiple data items, and are thus widely used for state change comparison. But they leverage implicit bar differences or line slopes to indicate state changes, which has been proven less effective for visual comparison. Both visualizations also suffer from visual scalability issues when an increasing number of data items need to be compared. This paper fills the research gap by proposing a novel radial visualization called Intercept Graph to facilitate visual comparison of multiple state changes. It consists of inner and outer axes, and leverages the lengths of line segments intercepted by the inner axis to explicitly encode the state changes. Users can interactively adjust the inner axis to filter large changes of their interest and magnify the difference of relatively-similar state changes, enhancing its visual scalability and comparison accuracy. We extensively evaluate the Intercept Graph in comparison with baseline methods through two usage scenarios, quantitative metric evaluations, and well-designed crowdsourcing user studies with 50 participants. Our results demonstrate the usefulness and effectiveness of the Intercept Graph.
- [381] arXiv:2406.13722 [pdf, html, other]
-
Title: Channel Charting in Real-World Coordinates with Distributed MIMOComments: Submitted to a journal. arXiv admin note: substantial text overlap with arXiv:2308.14498Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Channel charting is an emerging self-supervised method that maps channel-state information (CSI) to a low-dimensional latent space (the channel chart) that represents pseudo-positions of user equipments (UEs). While channel charts preserve local geometry, i.e., nearby UEs are nearby in the channel chart (and vice versa), the pseudo-positions are in arbitrary coordinates and global geometry is typically not preserved. In order to embed channel charts in real-world coordinates, we first propose a bilateration loss for distributed multiple-input multiple-output (D-MIMO) wireless systems in which only the access point (AP) positions are known. The idea behind this loss is to compare the received power at pairs of APs to determine whether a UE should be placed closer to one AP or the other in the channel chart. Second, we propose a line-of-sight (LoS) bounding-box loss that places the UE in a predefined LoS area of each AP that is estimated to have a LoS path to the UE. We demonstrate the efficacy of combining both of these loss functions with neural-network-based channel charting using ray-tracing-based and measurement-based channel vectors. Our approach outperforms several baselines and maintains the self-supervised nature of channel charting as it does not rely on geometrical propagation models or require ground-truth UE position information.
- [382] arXiv:2406.13724 [pdf, html, other]
-
Title: Heterogeneous Graph Neural Networks with Post-hoc Explanations for Multi-modal and Explainable Land Use InferenceSubjects: Artificial Intelligence (cs.AI)
Urban land use inference is a critically important task that aids in city planning and policy-making. Recently, the increased use of sensor and location technologies has facilitated the collection of multi-modal mobility data, offering valuable insights into daily activity patterns. Many studies have adopted advanced data-driven techniques to explore the potential of these multi-modal mobility data in land use inference. However, existing studies often process samples independently, ignoring the spatial correlations among neighbouring objects and heterogeneity among different services. Furthermore, the inherently low interpretability of complex deep learning methods poses a significant barrier in urban planning, where transparency and extrapolability are crucial for making long-term policy decisions. To overcome these challenges, we introduce an explainable framework for inferring land use that synergises heterogeneous graph neural networks (HGNs) with Explainable AI techniques, enhancing both accuracy and explainability. The empirical experiments demonstrate that the proposed HGNs significantly outperform baseline graph neural networks for all six land-use indicators, especially in terms of 'office' and 'sustenance'. As explanations, we consider feature attribution and counterfactual explanations. The analysis of feature attribution explanations shows that the symmetrical nature of the `residence' and 'work' categories predicted by the framework aligns well with the commuter's 'work' and 'recreation' activities in London. The analysis of the counterfactual explanations reveals that variations in node features and types are primarily responsible for the differences observed between the predicted land use distribution and the ideal mixed state. These analyses demonstrate that the proposed HGNs can suitably support urban stakeholders in their urban planning and policy-making.
- [383] arXiv:2406.13725 [pdf, html, other]
-
Title: Tree-Sliced Wasserstein Distance on a System of LinesComments: 33 pages, 6 figures, 2 tables, 4 algorithmsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Sliced Wasserstein (SW) distance in Optimal Transport (OT) is widely used in various applications thanks to its statistical effectiveness and computational efficiency. On the other hand, Tree Wassenstein (TW) and Tree-sliced Wassenstein (TSW) are instances of OT for probability measures where its ground cost is a tree metric. TSW also has a low computational complexity, i.e. linear to the number of edges in the tree. Especially, TSW is identical to SW when the tree is a chain. While SW is prone to loss of topological information of input measures due to relying on one-dimensional projection, TSW is more flexible and has a higher degree of freedom by choosing a tree rather than a line to alleviate the curse of dimensionality in SW. However, for practical applications, popular tree metric sampling methods are heavily built upon given supports, which limits their capacity to adapt to new supports. In this paper, we propose the Tree-Sliced Wasserstein distance on a System of Lines (TSW-SL), which brings a connection between SW and TSW. Compared to SW and TSW, our TSW-SL benefits from the higher degree of freedom of TSW while being suitable to dynamic settings as SW. In TSW-SL, we use a variant of the Radon Transform to project measures onto a system of lines, resulting in measures on a space with a tree metric, then leverage TW to efficiently compute distances between them. We empirically verify the advantages of TSW-SL over the traditional SW by conducting a variety of experiments on gradient flows, image style transfer, and generative models.
- [384] arXiv:2406.13731 [pdf, html, other]
-
Title: Integrating Fuzzy Logic with Causal Inference: Enhancing the Pearl and Neyman-Rubin MethodologiesComments: 25 pages, 6 figures, 4 tablesSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
In this paper, we generalize the Pearl and Neyman-Rubin methodologies in causal inference by introducing a generalized approach that incorporates fuzzy logic. Indeed, we introduce a fuzzy causal inference approach that consider both the vagueness and imprecision inherent in data, as well as the subjective human perspective characterized by fuzzy terms such as 'high', 'medium', and 'low'. To do so, we introduce two fuzzy causal effect formulas: the Fuzzy Average Treatment Effect (FATE) and the Generalized Fuzzy Average Treatment Effect (GFATE), together with their normalized versions: NFATE and NGFATE. When dealing with a binary treatment variable, our fuzzy causal effect formulas coincide with classical Average Treatment Effect (ATE) formula, that is a well-established and popular metric in causal inference. In FATE, all values of the treatment variable are considered equally important. In contrast, GFATE takes into account the rarity and frequency of these values. We show that for linear Structural Equation Models (SEMs), the normalized versions of our formulas, NFATE and NGFATE, are equivalent to ATE. Further, we provide identifiability criteria for these formulas and show their stability with respect to minor variations in the fuzzy subsets and the probability distributions involved. This ensures the robustness of our approach in handling small perturbations in the data. Finally, we provide several experimental examples to empirically validate and demonstrate the practical application of our proposed fuzzy causal inference methods.
- [385] arXiv:2406.13733 [pdf, html, other]
-
Title: You can't handle the (dirty) truth: Data-centric insights improve pseudo-labelingComments: Published in the Journal of Data-centric Machine Learning Research (DMLR) *Seedat & Huynh contributed equallySubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Pseudo-labeling is a popular semi-supervised learning technique to leverage unlabeled data when labeled samples are scarce. The generation and selection of pseudo-labels heavily rely on labeled data. Existing approaches implicitly assume that the labeled data is gold standard and 'perfect'. However, this can be violated in reality with issues such as mislabeling or ambiguity. We address this overlooked aspect and show the importance of investigating labeled data quality to improve any pseudo-labeling method. Specifically, we introduce a novel data characterization and selection framework called DIPS to extend pseudo-labeling. We select useful labeled and pseudo-labeled samples via analysis of learning dynamics. We demonstrate the applicability and impact of DIPS for various pseudo-labeling methods across an extensive range of real-world tabular and image datasets. Additionally, DIPS improves data efficiency and reduces the performance distinctions between different pseudo-labelers. Overall, we highlight the significant benefits of a data-centric rethinking of pseudo-labeling in real-world settings.
- [386] arXiv:2406.13734 [pdf, html, other]
-
Title: A Unified Core Structure in Multiplex Networks: From Finding the Densest Subgraph to Modeling User EngagementSubjects: Social and Information Networks (cs.SI)
In many complex systems, the interactions between objects span multiple aspects. Multiplex networks are accurate paradigms to model such systems, where each edge is associated with a type. A key graph mining primitive is extracting dense subgraphs, and this has led to interesting notions such as K-cores, known as building blocks of complex networks. Despite recent attempts to extend the notion of core to multiplex networks, existing studies suffer from a subset of the following limitations: They 1) force all nodes to exhibit their high degree in the same set of relation types while in multiplex networks some connection types can be noisy for some nodes, 2) either require high computational cost or miss the complex information of multiplex networks, and 3) assume the same importance for all relation types. We introduce S-core, a novel and unifying family of dense structures in multiplex networks that uses a function S(.) to summarize the degree vector of each node. We then discuss how one can choose a proper S(.) from the data. To demonstrate the usefulness of S-cores, we focus on finding the densest subgraph as well as modeling user engagement in multiplex networks. We present a new density measure in multiplex networks and discuss its advantages over existing density measures. We show that the problem of finding the densest subgraph in multiplex networks is NP-hard and design an efficient approximation algorithm based on S-cores. Finally, we present a new mathematical model of user engagement in the presence of different relation types. Our experiments shows the efficiency and effectiveness of our algorithms and supports the proposed mathematical model of user engagement.
- [387] arXiv:2406.13735 [pdf, html, other]
-
Title: StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic ImagesComments: Dataset website: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Understanding the semantics of visual scenes is a fundamental challenge in Computer Vision. A key aspect of this challenge is that objects sharing similar semantic meanings or functions can exhibit striking visual differences, making accurate identification and categorization difficult. Recent advancements in text-to-image frameworks have led to models that implicitly capture natural scene statistics. These frameworks account for the visual variability of objects, as well as complex object co-occurrences and sources of noise such as diverse lighting conditions. By leveraging large-scale datasets and cross-attention conditioning, these models generate detailed and contextually rich scene representations. This capability opens new avenues for improving object recognition and scene understanding in varied and challenging environments. Our work presents StableSemantics, a dataset comprising 224 thousand human-curated prompts, processed natural language captions, over 2 million synthetic images, and 10 million attention maps corresponding to individual noun chunks. We explicitly leverage human-generated prompts that correspond to visually interesting stable diffusion generations, provide 10 generations per phrase, and extract cross-attention maps for each image. We explore the semantic distribution of generated images, examine the distribution of objects within images, and benchmark captioning and open vocabulary segmentation methods on our data. To the best of our knowledge, we are the first to release a diffusion dataset with semantic attributions. We expect our proposed dataset to catalyze advances in visual semantic understanding and provide a foundation for developing more sophisticated and effective visual models. Website: this https URL
- [388] arXiv:2406.13743 [pdf, other]
-
Title: GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual GenerationBaiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, Deva RamananComments: We open-source our dataset, model, and code at: this https URL ; Project page: this https URL ;. arXiv admin note: substantial text overlap with arXiv:2404.01291Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
While text-to-visual models now produce photo-realistic images and videos, they struggle with compositional text prompts involving attributes, relationships, and higher-order reasoning such as logic and comparison. In this work, we conduct an extensive human study on GenAI-Bench to evaluate the performance of leading image and video generation models in various aspects of compositional text-to-visual generation. We also compare automated evaluation metrics against our collected human ratings and find that VQAScore -- a metric measuring the likelihood that a VQA model views an image as accurately depicting the prompt -- significantly outperforms previous metrics such as CLIPScore. In addition, VQAScore can improve generation in a black-box manner (without finetuning) via simply ranking a few (3 to 9) candidate images. Ranking by VQAScore is 2x to 3x more effective than other scoring methods like PickScore, HPSv2, and ImageReward at improving human alignment ratings for DALL-E 3 and Stable Diffusion, especially on compositional prompts that require advanced visio-linguistic reasoning. We will release a new GenAI-Rank benchmark with over 40,000 human ratings to evaluate scoring metrics on ranking images generated from the same prompt. Lastly, we discuss promising areas for improvement in VQAScore, such as addressing fine-grained visual details. We will release all human ratings (over 80,000) to facilitate scientific benchmarking of both generative models and automated metrics.
- [389] arXiv:2406.13744 [pdf, html, other]
-
Title: SRL-VIC: A Variable Stiffness-Based Safe Reinforcement Learning for Contact-Rich Robotic TasksComments: Accepted by IEEE RA-L,video is available at this https URLJournal-ref: IEEE Robotics and Automation Letters, vol. 9, no. 6, pp. 5631-5638, June 2024Subjects: Robotics (cs.RO)
Reinforcement learning (RL) has emerged as a promising paradigm in complex and continuous robotic tasks, however, safe exploration has been one of the main challenges, especially in contact-rich manipulation tasks in unstructured environments. Focusing on this issue, we propose SRL-VIC: a model-free safe RL framework combined with a variable impedance controller (VIC). Specifically, safety critic and recovery policy networks are pre-trained where safety critic evaluates the safety of the next action using a risk value before it is executed and the recovery policy suggests a corrective action if the risk value is high. Furthermore, the policies are updated online where the task policy not only achieves the task but also modulates the stiffness parameters to keep a safe and compliant profile. A set of experiments in contact-rich maze tasks demonstrate that our framework outperforms the baselines (without the recovery mechanism and without the VIC), yielding a good trade-off between efficient task accomplishment and safety guarantee. We show our policy trained on simulation can be deployed on a physical robot without fine-tuning, achieving successful task completion with robustness and generalization. The video is available at this https URL.
- [390] arXiv:2406.13748 [pdf, html, other]
-
Title: Every Language Counts: Learn and Unlearn in Multilingual LLMsSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
This paper investigates the propagation of harmful information in multilingual large language models (LLMs) and evaluates the efficacy of various unlearning methods. We demonstrate that fake information, regardless of the language it is in, once introduced into these models through training data, can spread across different languages, compromising the integrity and reliability of the generated content. Our findings reveal that standard unlearning techniques, which typically focus on English data, are insufficient in mitigating the spread of harmful content in multilingual contexts and could inadvertently reinforce harmful content across languages. We show that only by addressing harmful responses in both English and the original language of the harmful data can we effectively eliminate generations for all languages. This underscores the critical need for comprehensive unlearning strategies that consider the multilingual nature of modern LLMs to enhance their safety and reliability across diverse linguistic landscapes.
- [391] arXiv:2406.13752 [pdf, html, other]
-
Title: COAC: Cross-layer Optimization of Accelerator Configurability for Efficient CNN ProcessingComments: 14 pages,17 figures.Journal IEEE Transactions on Very Large Scale Integration (VLSI) SystemsJournal-ref: in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 31, no. 7, pp. 945-958, July 2023Subjects: Systems and Control (eess.SY)
To achieve high accuracy, convolutional neural networks (CNNs) are increasingly growing in complexity and diversity in layer types and topologies. This makes it very challenging to efficiently deploy such networks on custom processor architectures for resource-scarce edge devices. Existing mapping exploration frameworks enable searching for the optimal execution schedules or hardware mappings of individual network layers, by optimizing each layer's spatial (dataflow parallelization) and temporal unrolling (execution order). However, these tools fail to take into account the overhead of supporting different unrolling schemes within a common hardware architecture. Using a fixed unrolling scheme across all layers is also not ideal, as this misses significant opportunities for energy and latency savings from optimizing the mapping of diverse layer types. A balanced approach assesses the right amount of mapping flexibility needed across target neural networks, while taking into account the overhead to support multiple unrollings. This paper, therefore, presents COAC, a cross-layer design space exploration and mapping framework to optimize the flexibility of neural processing architectures by balancing configurability overhead against resulting energy and latency savings for end-to-end inference. COAC does not only provide a systematical analysis of the architectural overhead in function of the supported spatial unrollings, but also builds an automated flow to find the best unrolling combination(s) for efficient end-to-end inference with limited hardware overhead. Results demonstrate that architectures with carefully optimized flexibility can achieve up to 38% EDP (energy-delay-product) savings for a set of six neural networks at the expense of a relative area increase of 9.5%.
- [392] arXiv:2406.13754 [pdf, html, other]
-
Title: Concept Drift Visualization of SVM with Shifting WindowSubjects: Machine Learning (cs.LG)
In machine learning, concept drift is an evolution of information that invalidates the current data model. It happens when the statistical properties of the input data change over time in unforeseen ways. Concept drift detection is crucial when dealing with dynamically changing data. Its visualization can bring valuable insight into the data dynamics, especially for multidimensional data, and is related to visual knowledge discovery. We propose a novel visualization model based on parallel coordinates, denoted as parallel histograms through time. Our model represents histograms of feature distributions for successive time-shifted windows. The drift is shown as variations of these histograms, obtained by connecting the means of the distribution for successive time windows. We show how these diagrams can be used to explain the decision made by the machine learning model in choosing the drift point. By isolating the drift at the edges of successive time windows, there will be none (or reduced) drift within the adjacent windows. We illustrate this concept on both synthetic and real datasets. In our experiments, we use an incremental/decremental SVM with shifting window, introduced by us in previous work. With our proposed technique, in addition to detect the presence of concept drift, we can also depict it. This information can be further used to explain the change. mental results, opening the possibility for further investigations.
- [393] arXiv:2406.13761 [pdf, html, other]
-
Title: Exponential time differencing for matrix-valued dynamical systemsSubjects: Numerical Analysis (math.NA); Machine Learning (cs.LG); Optimization and Control (math.OC)
Matrix evolution equations occur in many applications, such as dynamical Lyapunov/Sylvester systems or Riccati equations in optimization and stochastic control, machine learning or data assimilation. In many cases, their tightest stability condition is coming from a linear term. Exponential time differencing (ETD) is known to produce highly stable numerical schemes by treating the linear term in an exact fashion. In particular, for stiff problems, ETD methods are a method of choice. We propose an extension of the class of ETD algorithms to matrix-valued dynamical equations. This allows us to produce highly efficient and stable integration schemes. We show their efficiency and applicability for a variety of real-world problems, from geophysical applications to dynamical problems in machine learning.
- [394] arXiv:2406.13762 [pdf, html, other]
-
Title: Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component AnalysisComments: 33 pages, 5 figures, 12 tablesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
The remarkable success of transformers in sequence modeling tasks, spanning various applications in natural language processing and computer vision, is attributed to the critical role of self-attention. Similar to the development of most deep learning models, the construction of these attention mechanisms rely on heuristics and experience. In our work, we derive self-attention from kernel principal component analysis (kernel PCA) and show that self-attention projects its query vectors onto the principal component axes of its key matrix in a feature space. We then formulate the exact formula for the value matrix in self-attention, theoretically and empirically demonstrating that this value matrix captures the eigenvectors of the Gram matrix of the key vectors in self-attention. Leveraging our kernel PCA framework, we propose Attention with Robust Principal Components (RPC-Attention), a novel class of robust attention that is resilient to data contamination. We empirically demonstrate the advantages of RPC-Attention over softmax attention on the ImageNet-1K object classification, WikiText-103 language modeling, and ADE20K image segmentation task.
- [395] arXiv:2406.13763 [pdf, html, other]
-
Title: Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Can large multimodal models have a human-like ability for emotional and social reasoning, and if so, how does it work? Recent research has discovered emergent theory-of-mind (ToM) reasoning capabilities in large language models (LLMs). LLMs can reason about people's mental states by solving various text-based ToM tasks that ask questions about the actors' ToM (e.g., human belief, desire, intention). However, human reasoning in the wild is often grounded in dynamic scenes across time. Thus, we consider videos a new medium for examining spatio-temporal ToM reasoning ability. Specifically, we ask explicit probing questions about videos with abundant social and emotional reasoning content. We develop a pipeline for multimodal LLM for ToM reasoning using video and text. We also enable explicit ToM reasoning by retrieving key frames for answering a ToM question, which reveals how multimodal LLMs reason about ToM.
- [396] arXiv:2406.13764 [pdf, html, other]
-
Title: Can LLMs Reason in the Wild with Programs?Subjects: Computation and Language (cs.CL)
Large Language Models (LLMs) have shown superior capability to solve reasoning problems with programs. While being a promising direction, most of such frameworks are trained and evaluated in settings with a prior knowledge of task requirements. However, as LLMs become more capable, it is necessary to assess their reasoning abilities in more realistic scenarios where many real-world problems are open-ended with ambiguous scope, and often require multiple formalisms to solve. To investigate this, we introduce the task of reasoning in the wild, where an LLM is tasked to solve a reasoning problem of unknown type by identifying the subproblems and their corresponding formalisms, and writing a program to solve each subproblem, guided by a tactic. We create a large tactic-guided trajectory dataset containing detailed solutions to a diverse set of reasoning problems, ranging from well-defined single-form reasoning (e.g., math, logic), to ambiguous and hybrid ones (e.g., commonsense, combined math and logic). This allows us to test various aspects of LLMs reasoning at the fine-grained level such as the selection and execution of tactics, and the tendency to take undesired shortcuts. In experiments, we highlight that existing LLMs fail significantly on problems with ambiguous and mixed scope, revealing critical limitations and overfitting issues (e.g. accuracy on GSM8K drops by at least 50\%). We further show the potential of finetuning a local LLM on the tactic-guided trajectories in achieving better performance. Project repo is available at this http URL
- [397] arXiv:2406.13768 [pdf, html, other]
-
Title: FastPersist: Accelerating Model Checkpointing in Deep LearningComments: 11 pagesSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
Model checkpoints are critical Deep Learning (DL) artifacts that enable fault tolerance for training and downstream applications, such as inference. However, writing checkpoints to persistent storage, and other I/O aspects of DL training, are mostly ignored by compute-focused optimization efforts for faster training of rapidly growing models and datasets. Towards addressing this imbalance, we propose FastPersist to accelerate checkpoint creation in DL training. FastPersist combines three novel techniques: (i) NVMe optimizations for faster checkpoint writes to SSDs, (ii) efficient write parallelism using the available SSDs in training environments, and (iii) overlapping checkpointing with independent training computations. Our evaluation using real world dense and sparse DL models shows that FastPersist creates checkpoints in persistent storage up to 116x faster than baseline, and enables per-iteration checkpointing with negligible overhead.
- [398] arXiv:2406.13770 [pdf, html, other]
-
Title: Elliptical AttentionComments: 38 pages, 7 figures, 12 tablesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Pairwise dot-product self-attention is key to the success of transformers that achieve state-of-the-art performance across a variety of applications in language and vision. This dot-product self-attention computes attention weights among the input tokens using Euclidean distance, which makes the model prone to representation collapse and vulnerable to contaminated samples. In this paper, we propose using a Mahalanobis distance metric for computing the attention weights to stretch the underlying feature space in directions of high contextual relevance. In particular, we define a hyper-ellipsoidal neighborhood around each query to increase the attention weights of the tokens lying in the contextually important directions. We term this novel class of attention Elliptical Attention. Our Elliptical Attention provides two benefits: 1) reducing representation collapse and 2) enhancing the model's robustness as the Elliptical Attention pays more attention to contextually relevant information rather than focusing on some small subset of informative features. We empirically demonstrate the advantages of Elliptical Attention over the baseline dot-product attention and state-of-the-art attention methods on various practical tasks, including object classification, image segmentation, and language modeling across different data modalities.
- [399] arXiv:2406.13777 [pdf, html, other]
-
Title: Game of LLMs: Discovering Structural Constructs in Activities using Large Language ModelsComments: 6 pages, 2 figuresSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Human Activity Recognition is a time-series analysis problem. A popular analysis procedure used by the community assumes an optimal window length to design recognition pipelines. However, in the scenario of smart homes, where activities are of varying duration and frequency, the assumption of a constant sized window does not hold. Additionally, previous works have shown these activities to be made up of building blocks. We focus on identifying these underlying building blocks--structural constructs, with the use of large language models. Identifying these constructs can be beneficial especially in recognizing short-duration and infrequent activities. We also propose the development of an activity recognition procedure that uses these building blocks to model activities, thus helping the downstream task of activity monitoring in smart homes.
- [400] arXiv:2406.13778 [pdf, html, other]
-
Title: Benchmarking Unsupervised Online IDS for Masquerade Attacks in CANComments: 15 pages, 9 figures, 3 tablesSubjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Vehicular controller area networks (CANs) are susceptible to masquerade attacks by malicious adversaries. In masquerade attacks, adversaries silence a targeted ID and then send malicious frames with forged content at the expected timing of benign frames. As masquerade attacks could seriously harm vehicle functionality and are the stealthiest attacks to detect in CAN, recent work has devoted attention to compare frameworks for detecting masquerade attacks in CAN. However, most existing works report offline evaluations using CAN logs already collected using simulations that do not comply with domain's real-time constraints. Here we contribute to advance the state of the art by introducing a benchmark study of four different non-deep learning (DL)-based unsupervised online intrusion detection systems (IDS) for masquerade attacks in CAN. Our approach differs from existing benchmarks in that we analyze the effect of controlling streaming data conditions in a sliding window setting. In doing so, we use realistic masquerade attacks being replayed from the ROAD dataset. We show that although benchmarked IDS are not effective at detecting every attack type, the method that relies on detecting changes at the hierarchical structure of clusters of time series produces the best results at the expense of higher computational overhead. We discuss limitations, open challenges, and how the benchmarked methods can be used for practical unsupervised online CAN IDS for masquerade attacks.
- [401] arXiv:2406.13779 [pdf, html, other]
-
Title: FoRAG: Factuality-optimized Retrieval Augmented Generation for Web-enhanced Long-form Question AnsweringJournal-ref: In KDD 2024, Barcelona, Spain. 30 pagesSubjects: Computation and Language (cs.CL)
Retrieval Augmented Generation (RAG) has become prevalent in question-answering (QA) tasks due to its ability of utilizing search engine to enhance the quality of long-form question-answering (LFQA). Despite the emergence of various open source methods and web-enhanced commercial systems such as Bing Chat, two critical problems remain unsolved, i.e., the lack of factuality and clear logic in the generated long-form answers. In this paper, we remedy these issues via a systematic study on answer generation in web-enhanced LFQA. Specifically, we first propose a novel outline-enhanced generator to achieve clear logic in the generation of multifaceted answers and construct two datasets accordingly. Then we propose a factuality optimization method based on a carefully designed doubly fine-grained RLHF framework, which contains automatic evaluation and reward modeling in different levels of granularity. Our generic framework comprises conventional fine-grained RLHF methods as special cases. Extensive experiments verify the superiority of our proposed \textit{Factuality-optimized RAG (FoRAG)} method on both English and Chinese benchmarks. In particular, when applying our method to Llama2-7B-chat, the derived model FoRAG-L-7B outperforms WebGPT-175B in terms of three commonly used metrics (i.e., coherence, helpfulness, and factuality), while the number of parameters is much smaller (only 1/24 of that of WebGPT-175B). Our datasets and models are made publicly available for better reproducibility: this https URL.
- [402] arXiv:2406.13781 [pdf, html, other]
-
Title: A Primal-Dual Framework for Transformers and Neural NetworksComments: Accepted to ICLR 2023, 26 pages, 4 figures, 14 tablesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Self-attention is key to the remarkable success of transformers in sequence modeling tasks including many applications in natural language processing and computer vision. Like neural network layers, these attention mechanisms are often developed by heuristics and experience. To provide a principled framework for constructing attention layers in transformers, we show that the self-attention corresponds to the support vector expansion derived from a support vector regression problem, whose primal formulation has the form of a neural network layer. Using our framework, we derive popular attention layers used in practice and propose two new attentions: 1) the Batch Normalized Attention (Attention-BN) derived from the batch normalization layer and 2) the Attention with Scaled Head (Attention-SH) derived from using less training data to fit the SVR model. We empirically demonstrate the advantages of the Attention-BN and Attention-SH in reducing head redundancy, increasing the model's accuracy, and improving the model's efficiency in a variety of practical applications including image and time-series classification.
- [403] arXiv:2406.13787 [pdf, html, other]
-
Title: LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration -- A Robot Sous-Chef ApplicationComments: Spotlight Presentation at the 3rd Workshop on Computer Vision in the Wild at CVPR 2024. Also accepted by the 5th Annual Embodied AI Workshop at CVPR 2024Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Large Language Models (LLM) and Vision Language Models (VLM) enable robots to ground natural language prompts into control actions to achieve tasks in an open world. However, when applied to a long-horizon collaborative task, this formulation results in excessive prompting for initiating or clarifying robot actions at every step of the task. We propose Language-driven Intention Tracking (LIT), leveraging LLMs and VLMs to model the human user's long-term behavior and to predict the next human intention to guide the robot for proactive collaboration. We demonstrate smooth coordination between a LIT-based collaborative robot and the human user in collaborative cooking tasks.
- [404] arXiv:2406.13791 [pdf, html, other]
-
Title: IoT-Based Preventive Mental Health Using Knowledge Graphs and Standards for Better Well-BeingComments: 20 pagesSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
Sustainable Development Goals (SDGs) give the UN a road map for development with Agenda 2030 as a target. SDG3 "Good Health and Well-Being" ensures healthy lives and promotes well-being for all ages. Digital technologies can support SDG3. Burnout and even depression could be reduced by encouraging better preventive health. Due to the lack of patient knowledge and focus to take care of their health, it is necessary to help patients before it is too late. New trends such as positive psychology and mindfulness are highly encouraged in the USA. Digital Twin (DT) can help with the continuous monitoring of emotion using physiological signals (e.g., collected via wearables). Digital twins facilitate monitoring and provide constant health insight to improve quality of life and well-being with better personalization. Healthcare DT challenges are standardizing data formats, communication protocols, and data exchange mechanisms. To achieve those data integration and knowledge challenges, we designed the Mental Health Knowledge Graph (ontology and dataset) to boost mental health. The Knowledge Graph (KG) acquires knowledge from ontology-based mental health projects classified within the LOV4IoT ontology catalog (Emotion, Depression, and Mental Health). Furthermore, the KG is mapped to standards (e.g., ontologies) when possible. Standards from ETSI SmartM2M, ITU/WHO, ISO, W3C, NIST, and IEEE are relevant to mental health.
- [405] arXiv:2406.13793 [pdf, html, other]
-
Title: Exploring the Optimal Time Window for Predicting Cognitive Load Using Physiological Sensor DataComments: Presented at PhysioCHI: Towards Best Practices for Integrating Physiological Signals in HCI, May 11, 2024, Honolulu, HI, USASubjects: Human-Computer Interaction (cs.HC)
Learning analytics has begun to use physiological signals because these have been linked with learners' cognitive and affective states. These signals, when interpreted through machine learning techniques, offer a nuanced understanding of the temporal dynamics of student learning experiences and processes. However, there is a lack of clear guidance on the optimal time window to use for analyzing physiological signals within predictive models. We conducted an empirical investigation of different time windows (ranging from 60 to 210 seconds) when analysing multichannel physiological sensor data for predicting cognitive load. Our results demonstrate a preference for longer time windows, with optimal window length typically exceeding 90 seconds. These findings challenge the conventional focus on immediate physiological responses, suggesting that a broader temporal scope could provide a more comprehensive understanding of cognitive processes. In addition, the variation in which time windows best supported prediction across classifiers underscores the complexity of integrating physiological measures. Our findings provide new insights for developing educational technologies that more accurately reflect and respond to the dynamic nature of learner cognitive load in complex learning environments.
- [406] arXiv:2406.13794 [pdf, html, other]
-
Title: Adaptive Curves for Optimally Efficient Market MakingSubjects: Systems and Control (eess.SY); Computational Engineering, Finance, and Science (cs.CE); Trading and Market Microstructure (q-fin.TR)
Automated Market Makers (AMMs) are essential in Decentralized Finance (DeFi) as they match liquidity supply with demand. They function through liquidity providers (LPs) who deposit assets into liquidity pools. However, the asset trading prices in these pools often trail behind those in more dynamic, centralized exchanges, leading to potential arbitrage losses for LPs. This issue is tackled by adapting market maker bonding curves to trader behavior, based on the classical market microstructure model of Glosten and Milgrom. Our approach ensures a zero-profit condition for the market maker's prices. We derive the differential equation that an optimal adaptive curve should follow to minimize arbitrage losses while remaining competitive. Solutions to this optimality equation are obtained for standard Gaussian and Lognormal price models using Kalman filtering. A key feature of our method is its ability to estimate the external market price without relying on price or loss oracles. We also provide an equivalent differential equation for the implied dynamics of canonical static bonding curves and establish conditions for their optimality. Our algorithms demonstrate robustness to changing market conditions and adversarial perturbations, and we offer an on-chain implementation using Uniswap v4 alongside off-chain AI co-processors.
- [407] arXiv:2406.13796 [pdf, html, other]
-
Title: NeRF-Feat: 6D Object Pose Estimation using Feature RenderingComments: 3DV 2024Journal-ref: 3DV 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Object Pose Estimation is a crucial component in robotic grasping and augmented reality. Learning based approaches typically require training data from a highly accurate CAD model or labeled training data acquired using a complex setup. We address this by learning to estimate pose from weakly labeled data without a known CAD model. We propose to use a NeRF to learn object shape implicitly which is later used to learn view-invariant features in conjunction with CNN using a contrastive loss. While NeRF helps in learning features that are view-consistent, CNN ensures that the learned features respect symmetry. During inference, CNN is used to predict view-invariant features which can be used to establish correspondences with the implicit 3d model in NeRF. The correspondences are then used to estimate the pose in the reference frame of NeRF. Our approach can also handle symmetric objects unlike other approaches using a similar training setup. Specifically, we learn viewpoint invariant, discriminative features using NeRF which are later used for pose estimation. We evaluated our approach on LM, LM-Occlusion, and T-Less dataset and achieved benchmark accuracy despite using weakly labeled data.
- [408] arXiv:2406.13797 [pdf, html, other]
-
Title: Quantum automata and languages of finite indexComments: arXiv admin note: text overlap with arXiv:1303.2967Subjects: Formal Languages and Automata Theory (cs.FL)
This paper is a continuation of a previous study on the so-called measure once finite quantum automata model introduced by Moore and Crutchfield in 2000. We investigate conditions assuring that, given a language recognized by such a device and a language generated by a context-free grammar of finite index or by a matrix context-free grammar, it is recursively decidable whether or not they have a nonempty intersection.
- [409] arXiv:2406.13800 [pdf, html, other]
-
Title: A Graph Model and a Layout Algorithm for Knitting PatternsComments: 13 pages, 6 tables, 7 figuresSubjects: Human-Computer Interaction (cs.HC)
Knitting, an ancient fiber art, creates a structured fabric consisting of loops or stitches. Publishing hand knitting patterns involves lengthy testing periods and numerous knitters. Modeling knitting patterns with graphs can help expedite error detection and pattern validation. In this paper, we describe how to model simple knitting patterns as planar graphs. We then design, implement, and evaluate a layout algorithm to visualize knitting patterns. Knitting patterns correspond to graphs with pre-specified edge lengths (e.g., uniform lengths, two lengths, etc.). This yields a natural graph layout optimization problem: realize a planar graph with pre-specified edge lengths, while ensuring there are no edge crossings. We quantitatively evaluate our algorithm using real knitting patterns of various sizes against three others; one created for knitting patterns, one that maintains planarity and optimizes edge lengths, and a popular force-directed algorithm.
- [410] arXiv:2406.13803 [pdf, html, other]
-
Title: Semantic Structure-Mapping in LLM and Human Analogical ReasoningSubjects: Computation and Language (cs.CL)
Analogical reasoning is considered core to human learning and cognition. Recent studies have compared the analogical reasoning abilities of human subjects and Large Language Models (LLMs) on abstract symbol manipulation tasks, such as letter string analogies. However, these studies largely neglect analogical reasoning over semantically meaningful symbols, such as natural language words. This ability to draw analogies that link language to non-linguistic domains, which we term semantic structure-mapping, is thought to play a crucial role in language acquisition and broader cognitive development. We test human subjects and LLMs on analogical reasoning tasks that require the transfer of semantic structure and content from one domain to another. Advanced LLMs match human performance across many task variations. However, humans and LLMs respond differently to certain task variations and semantic distractors. Overall, our data suggest that LLMs are approaching human-level performance on these important cognitive tasks, but are not yet entirely human like.
- [411] arXiv:2406.13805 [pdf, html, other]
-
Title: WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from WikipediaYufang Hou, Alessandra Pascale, Javier Carnerero-Cano, Tigran Tchrakian, Radu Marinescu, Elizabeth Daly, Inkit Padhi, Prasanna SattigeriSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs), such as hallucinations and outdated information. However, it remains unclear how LLMs handle knowledge conflicts arising from different augmented retrieved passages, especially when these passages originate from the same source and have equal trustworthiness. In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions that have varying answers based on contradictory passages from Wikipedia, a dataset widely regarded as a high-quality pre-training resource for most LLMs. Specifically, we introduce WikiContradict, a benchmark consisting of 253 high-quality, human-annotated instances designed to assess LLM performance when augmented with retrieved passages containing real-world knowledge conflicts. We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages. Through rigorous human evaluations on a subset of WikiContradict instances involving 5 LLMs and over 3,500 judgements, we shed light on the behaviour and limitations of these models. For instance, when provided with two passages containing contradictory facts, all models struggle to generate answers that accurately reflect the conflicting nature of the context, especially for implicit conflicts requiring reasoning. Since human evaluation is costly, we also introduce an automated model that estimates LLM performance using a strong open-source language model, achieving an F-score of 0.8. Using this automated metric, we evaluate more than 1,500 answers from seven LLMs across all WikiContradict instances. To facilitate future work, we release WikiContradict on: this https URL.
- [412] arXiv:2406.13807 [pdf, html, other]
-
Title: AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video UnderstandingAlessandro Suglia, Claudio Greco, Katie Baker, Jose L. Part, Ioannis Papaionnou, Arash Eshghi, Ioannis Konstas, Oliver LemonComments: Code available this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
AI personal assistants deployed via robots or wearables require embodied understanding to collaborate with humans effectively. However, current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric perceptual experience. To address this gap, we propose three key contributions. First, we introduce the Egocentric Video Understanding Dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos. Second, we present AlanaVLM, a 7B parameter VLM trained using parameter-efficient methods on EVUD. Finally, we evaluate AlanaVLM's capabilities on OpenEQA, a challenging benchmark for embodied video question answering. Our model achieves state-of-the-art performance, outperforming open-source models including strong Socratic models using GPT-4 as a planner by 3.6%. Additionally, we outperform Claude 3 and Gemini Pro Vision 1.0 and showcase competitive results compared to Gemini Pro 1.5 and GPT-4V, even surpassing the latter in spatial reasoning. This research paves the way for building efficient VLMs that can be deployed in robots or wearables, leveraging embodied video understanding to collaborate seamlessly with humans in everyday tasks, contributing to the next generation of Embodied AI
- [413] arXiv:2406.13808 [pdf, html, other]
-
Title: Can Low-Rank Knowledge Distillation in LLMs be Useful for Microelectronic Reasoning?Comments: 4 pages, 2 figures, 2 tables, The First IEEE International Workshop on LLM-Aided Design (LAD'24)Subjects: Machine Learning (cs.LG)
In this work, we present empirical results regarding the feasibility of using offline large language models (LLMs) in the context of electronic design automation (EDA). The goal is to investigate and evaluate a contemporary language model's (Llama-2-7B) ability to function as a microelectronic Q & A expert as well as its reasoning, and generation capabilities in solving microelectronic-related problems. Llama-2-7B was tested across a variety of adaptation methods, including introducing a novel low-rank knowledge distillation (LoRA-KD) scheme. Our experiments produce both qualitative and quantitative results.
- [414] arXiv:2406.13809 [pdf, other]
-
Title: Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text DatasetSubjects: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
A more robust and holistic language-video representation is the key to pushing video understanding forward. Despite the improvement in training strategies, the quality of the language-video dataset is less attention to. The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks where queries are much more complex. This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware for more sophisticated representation learning needs, hence helping all downstream tasks. Our multifaceted video captioning method captures entities, actions, speech transcripts, aesthetics, and emotional cues, providing detailed and correlating information from the text side to the video side for training. We also develop an agent-like strategy using language models to generate high-quality, factual textual descriptions, reducing human intervention and enabling scalability. The method's effectiveness in improving language-video representation is evaluated through text-video retrieval using the MSR-VTT dataset and several multi-modal retrieval models.
- [415] arXiv:2406.13813 [pdf, other]
-
Title: The Efficacy of Conversational Artificial Intelligence in Rectifying the Theory of Mind and Autonomy Biases: Comparative AnalysisComments: 28 pages, 5 tables, 6 figuresSubjects: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
The study evaluates the efficacy of Conversational Artificial Intelligence (CAI) in rectifying cognitive biases and recognizing affect in human-AI interactions, which is crucial for digital mental health interventions. Cognitive biases (systematic deviations from normative thinking) affect mental health, intensifying conditions like depression and anxiety. Therapeutic chatbots can make cognitive-behavioral therapy (CBT) more accessible and affordable, offering scalable and immediate support. The research employs a structured methodology with clinical-based virtual case scenarios simulating typical user-bot interactions. Performance and affect recognition were assessed across two categories of cognitive biases: theory of mind biases (anthropomorphization of AI, overtrust in AI, attribution to AI) and autonomy biases (illusion of control, fundamental attribution error, just-world hypothesis). A qualitative feedback mechanism was used with an ordinal scale to quantify responses based on accuracy, therapeutic quality, and adherence to CBT principles. Therapeutic bots (Wysa, Youper) and general-use LLMs (GTP 3.5, GTP 4, Gemini Pro) were evaluated through scripted interactions, double-reviewed by cognitive scientists and a clinical psychologist. Statistical analysis showed therapeutic bots were consistently outperformed by non-therapeutic bots in bias rectification and in 4 out of 6 biases in affect recognition. The data suggests that non-therapeutic chatbots are more effective in addressing some cognitive biases.
- [416] arXiv:2406.13817 [pdf, html, other]
-
Title: SkyGrid: Energy-Flow Optimization at Harmonized Aerial IntersectionsComments: 8 pages, 12 figures - Submitted to IEEE VTC Fall 2024 - Under ReviewSubjects: Systems and Control (eess.SY)
The rapid evolution of urban air mobility (UAM) is reshaping the future of transportation by integrating aerial vehicles into urban transit systems. The design of aerial intersections plays a critical role in the phased development of UAM systems to ensure safe and efficient operations in air corridors. This work adapts the concept of rhythmic control of connected and automated vehicles (CAVs) at unsignalized intersections to address complex traffic control problems. This control framework assigns UAM vehicles to different movement groups and significantly reduces the computation of routing strategies to avoid conflicts. In contrast to ground traffic, the objective is to balance three measures: minimizing energy utilization, maximizing intersection flow (throughput), and maintaining safety distances. This optimization method dynamically directs traffic with various demands, considering path assignment distributions and segment-level trajectory coefficients for straight and curved paths as control variables. To the best of our knowledge, this is the first work to consider a multi-objective optimization approach for unsignalized intersection control in the air and to propose such optimization in a rhythmic control setting with time arrival and UAM operational constraints. A sensitivity analysis with respect to inter-platoon safety and straight/left demand balance demonstrates the effectiveness of our method in handling traffic under various scenarios.
- [417] arXiv:2406.13820 [pdf, html, other]
-
Title: Framing Social Movements on Social Media: Unpacking Diagnostic, Prognostic, and Motivational StrategiesComments: Published in ICWSM Special Issue of the Journal of Quantitative Description: Digital MediaJournal-ref: Journal of Quantitative Description: Digital Media (2024)Subjects: Computation and Language (cs.CL)
Social media enables activists to directly communicate with the public and provides a space for movement leaders, participants, bystanders, and opponents to collectively construct and contest narratives. Focusing on Twitter messages from social movements surrounding three issues in 2018-2019 (guns, immigration, and LGBTQ rights), we create a codebook, annotated dataset, and computational models to detect diagnostic (problem identification and attribution), prognostic (proposed solutions and tactics), and motivational (calls to action) framing strategies. We conduct an in-depth unsupervised linguistic analysis of each framing strategy, and uncover cross-movement similarities in associations between framing and linguistic features such as pronouns and deontic modal verbs. Finally, we compare framing strategies across issues and other social, cultural, and interactional contexts. For example, we show that diagnostic framing is more common in replies than original broadcast posts, and that social movement organizations focus much more on prognostic and motivational framing than journalists and ordinary citizens.
- [418] arXiv:2406.13824 [pdf, html, other]
-
Title: Symmetrically Fair Allocations of Indivisible GoodsSubjects: Computer Science and Game Theory (cs.GT); Optimization and Control (math.OC)
We consider allocating indivisible goods with provable fairness guarantees that are satisfied regardless of which bundle of items each agent receives. Symmetrical allocations of this type are known to exist for divisible resources, such as consensus splitting of a cake into parts, each having equal value for all agents, ensuring that in any allocation of the cake slices, no agent would envy another. For indivisible goods, one analogous concept relaxes envy freeness to guarantee the existence of an allocation in which any bundle is worth as much as any other, up to the value of a bounded number of items from the other bundle. Previous work has studied the number of items that need to be removed. In this paper, we improve upon these bounds for the specific setting in which the number of bundles equals the number of agents.
Concretely, we develop the theory of symmetrically envy free up to one good, or symEF1, allocations. We prove that a symEF1 allocation exists if the vertices of a related graph can be partitioned (colored) into as many independent sets as there are agents. This sufficient condition always holds for two agents, and for agents that have identical, disjoint, or binary valuations. We further prove conditions under which exponentially-many distinct symEF1 allocations exist. Finally, we perform computational experiments to study the incidence of symEF1 allocations as a function of the number of agents and items when valuations are drawn uniformly at random. - [419] arXiv:2406.13827 [pdf, html, other]
-
Title: Fine-Tuning BERTs for Definition Extraction from Mathematical TextSubjects: Computation and Language (cs.CL)
In this paper, we fine-tuned three pre-trained BERT models on the task of "definition extraction" from mathematical English written in LaTeX. This is presented as a binary classification problem, where either a sentence contains a definition of a mathematical term or it does not. We used two original data sets, "Chicago" and "TAC," to fine-tune and test these models. We also tested on WFMALL, a dataset presented by Vanetik and Litvak in 2021 and compared the performance of our models to theirs. We found that a high-performance Sentence-BERT transformer model performed best based on overall accuracy, recall, and precision metrics, achieving comparable results to the earlier models with less computational effort.
- [420] arXiv:2406.13828 [pdf, html, other]
-
Title: Neuro-symbolic Training for Reasoning over Spatial LanguageSubjects: Computation and Language (cs.CL)
Recent research shows that more data and larger models can provide more accurate solutions to natural language problems requiring reasoning. However, models can easily fail to provide solutions in unobserved complex input compositions due to not achieving the level of abstraction required for generalizability. To alleviate this issue, we propose training the language models with neuro-symbolic techniques that can exploit the logical rules of reasoning as constraints and provide additional supervision sources to the model. Training models to adhere to the regulations of reasoning pushes them to make more effective abstractions needed for generalizability and transfer learning. We focus on a challenging problem of spatial reasoning over text. Our results on various benchmarks using multiple language models confirm our hypothesis of effective domain transfer based on neuro-symbolic training.
- [421] arXiv:2406.13829 [pdf, html, other]
-
Title: Group-Control Motion Planning Framework for Microrobot Swarms in a Global FieldSubjects: Robotics (cs.RO)
This paper investigates how group-control can be effectively used for motion planning for microrobot swarms in a global field. We prove that Small-Time Local Controllability (STLC) in robot positions is achievable through group-control, with the minimum number of groups required for STLC being $\log_2(n + 2) + 1$ for $n$ robots. We then discuss the complexity trade-offs between control and motion planning. We show how motion planning can be simplified if appropriate primitives can be achieved through more complex control actions. We identify motion planning problems that balance the number of robot groups and motion primitives with planning complexity. Various instantiations of these motion planning problems are explored, with simulations to demonstrate the effectiveness of group-control.
- [422] arXiv:2406.13831 [pdf, html, other]
-
Title: A Comprehensive Overview of GPU Accelerated DatabasesSubjects: Databases (cs.DB)
Over the past decade, the landscape of data analytics has seen a notable shift towards heterogeneous architectures, particularly the integration of GPUs to enhance overall performance. In the realm of in-memory analytics, which often grapples with memory bandwidth constraints, the adoption of GPUs has proven advantageous, thanks to their superior bandwidth capabilities. The parallel processing prowess of GPUs stands out, providing exceptional efficiency for data-intensive workloads and outpacing traditional CPUs in terms of data processing speed. While GPU databases capitalize on these strengths, there remains a scarcity of comparative studies across different GPU systems. In light of this emerging interest in GPU databases for data analytics, this paper proposes a survey encompassing multiple GPU database systems. The focus will be on elucidating the underlying mechanisms employed to deliver results and key performance metrics, utilizing benchmarks such as SSB and TPCH. This undertaking aims to shed light on new avenues for research within the realm of GPU databases.
- [423] arXiv:2406.13834 [pdf, html, other]
-
Title: Optimizing Wireless Discontinuous Reception via MAC Signaling LearningComments: 9 pages, 10 figures, submitted to IEEE TMLCNSubjects: Information Theory (cs.IT); Machine Learning (cs.LG)
We present a Reinforcement Learning (RL) approach to the problem of controlling the Discontinuous Reception (DRX) policy from a Base Transceiver Station (BTS) in a cellular network. We do so by means of optimally timing the transmission of fast Layer-2 signaling messages (a.k.a. Medium Access Layer (MAC) Control Elements (CEs) as specified in 5G New Radio). Unlike more conventional approaches to DRX optimization, which rely on fine-tuning the values of DRX timers, we assess the gains that can be obtained solely by means of this MAC CE signalling. For the simulation part, we concentrate on traffic types typically encountered in Extended Reality (XR) applications, where the need for battery drain minimization and overheating mitigation are particularly pressing. Both 3GPP 5G New Radio (5G NR) compliant and non-compliant ("beyond 5G") MAC CEs are considered. Our simulation results show that our proposed technique strikes an improved trade-off between latency and energy savings as compared to conventional timer-based approaches that are characteristic of most current implementations. Specifically, our RL-based policy can nearly halve the active time for a single User Equipment (UE) with respect to a naïve MAC CE transmission policy, and still achieve near 20% active time reduction for 9 simultaneously served UEs.
- [424] arXiv:2406.13835 [pdf, html, other]
-
Title: Bundling in Oligopoly: Revenue Maximization with Single-Item CompetitorsComments: Accepted to EC 2024Subjects: Computer Science and Game Theory (cs.GT); Theoretical Economics (econ.TH)
We consider a principal seller with $m$ heterogeneous products to sell to an additive buyer over independent items. The principal can offer an arbitrary menu of product bundles, but faces competition from smaller and more agile single-item sellers. The single-item sellers choose their prices after the principal commits to a menu, potentially under-cutting the principal's offerings. We explore to what extent the principal can leverage the ability to bundle product together to extract revenue.
Any choice of menu by the principal induces an oligopoly pricing game between the single-item sellers, which may have multiple equilibria. When there is only a single item this model reduces to Bertrand competition, for which the principal's revenue is $0$ at any equilibrium, so we assume that no single item's value is too dominant. We establish an upper bound on the principal's optimal revenue at every equilibrium: the expected welfare after truncating each item's value to its revenue-maximizing price. Under a technical condition on the value distributions -- that the monopolist's revenue is sufficiently sensitive to price -- we show that the principal seller can simply price the grand-bundle and ensure (in any equilibrium) a constant approximation to this bound (and hence to the optimal revenue). We also show that for some value distributions violating our conditions, grand-bundle pricing does not yield a constant approximation to the optimal revenue in any equilibrium. - [425] arXiv:2406.13840 [pdf, html, other]
-
Title: StackRAG Agent: Improving Developer Answers with Retrieval-Augmented GenerationSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Developers spend much time finding information that is relevant to their questions. Stack Overflow has been the leading resource, and with the advent of Large Language Models (LLMs), generative models such as ChatGPT are used frequently. However, there is a catch in using each one separately. Searching for answers is time-consuming and tedious, as shown by the many tools developed by researchers to address this issue. On the other, using LLMs is not reliable, as they might produce irrelevant or unreliable answers (i.e., hallucination). In this work, we present StackRAG, a retrieval-augmented Multiagent generation tool based on LLMs that combines the two worlds: aggregating the knowledge from SO to enhance the reliability of the generated answers. Initial evaluations show that the generated answers are correct, accurate, relevant, and useful.
- [426] arXiv:2406.13842 [pdf, html, other]
-
Title: Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic ControlComments: Accepted at Interspeech 2024Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Utilizing air-traffic control (ATC) data for downstream natural-language processing tasks requires preprocessing steps. Key steps are the transcription of the data via automatic speech recognition (ASR) and speaker diarization, respectively speaker role detection (SRD) to divide the transcripts into pilot and air-traffic controller (ATCO) transcripts. While traditional approaches take on these tasks separately, we propose a transformer-based joint ASR-SRD system that solves both tasks jointly while relying on a standard ASR architecture. We compare this joint system against two cascaded approaches for ASR and SRD on multiple ATC datasets. Our study shows in which cases our joint system can outperform the two traditional approaches and in which cases the other architectures are preferable. We additionally evaluate how acoustic and lexical differences influence all architectures and show how to overcome them for our joint architecture.
- [427] arXiv:2406.13843 [pdf, html, other]
-
Title: Generative AI Misuse: A Taxonomy of Tactics and Insights from Real-World DataSubjects: Artificial Intelligence (cs.AI)
Generative, multimodal artificial intelligence (GenAI) offers transformative potential across industries, but its misuse poses significant risks. Prior research has shed light on the potential of advanced AI systems to be exploited for malicious purposes. However, we still lack a concrete understanding of how GenAI models are specifically exploited or abused in practice, including the tactics employed to inflict harm. In this paper, we present a taxonomy of GenAI misuse tactics, informed by existing academic literature and a qualitative analysis of approximately 200 observed incidents of misuse reported between January 2023 and March 2024. Through this analysis, we illuminate key and novel patterns in misuse during this time period, including potential motivations, strategies, and how attackers leverage and abuse system capabilities across modalities (e.g. image, text, audio, video) in the wild.
- [428] arXiv:2406.13844 [pdf, html, other]
-
Title: MAMA-MIA: A Large-Scale Multi-Center Breast Cancer DCE-MRI Benchmark Dataset with Expert SegmentationsLidia Garrucho, Claire-Anne Reidel, Kaisar Kushibar, Smriti Joshi, Richard Osuala, Apostolia Tsirikoglou, Maciej Bobowicz, Javier del Riego, Alessandro Catanese, Katarzyna Gwoździewicz, Maria-Laura Cosaka, Pasant M. Abo-Elhoda, Sara W. Tantawy, Shorouq S. Sakrana, Norhan O. Shawky-Abdelfatah, Amr Muhammad Abdo-Salem, Androniki Kozana, Eugen Divjak, Gordana Ivanac, Katerina Nikiforaki, Michail E. Klontzas, Rosa García-Dosdá, Meltem Gulsun-Akpinar, Oğuz Lafcı, Ritse Mann, Carlos Martín-Isla, Fred Prior, Kostas Marias, Martijn P.A. Starmans, Fredrik Strand, Oliver Díaz, Laura Igual, Karim LekadirComments: 15 paes, 7 figures, 3 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Databases (cs.DB)
Current research in breast cancer Magnetic Resonance Imaging (MRI), especially with Artificial Intelligence (AI), faces challenges due to the lack of expert segmentations. To address this, we introduce the MAMA-MIA dataset, comprising 1506 multi-center dynamic contrast-enhanced MRI cases with expert segmentations of primary tumors and non-mass enhancement areas. These cases were sourced from four publicly available collections in The Cancer Imaging Archive (TCIA). Initially, we trained a deep learning model to automatically segment the cases, generating preliminary segmentations that significantly reduced expert segmentation time. Sixteen experts, averaging 9 years of experience in breast cancer, then corrected these segmentations, resulting in the final expert segmentations. Additionally, two radiologists conducted a visual inspection of the automatic segmentations to support future quality control studies. Alongside the expert segmentations, we provide 49 harmonized demographic and clinical variables and the pretrained weights of the well-known nnUNet architecture trained using the DCE-MRI full-images and expert segmentations. This dataset aims to accelerate the development and benchmarking of deep learning models and foster innovation in breast cancer diagnostics and treatment planning.
- [429] arXiv:2406.13846 [pdf, html, other]
-
Title: Text Serialization and Their Relationship with the Conventional Paradigms of Tabular Machine LearningComments: Accepted into the ICML AI4Science WorkshopSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Recent research has explored how Language Models (LMs) can be used for feature representation and prediction in tabular machine learning tasks. This involves employing text serialization and supervised fine-tuning (SFT) techniques. Despite the simplicity of these techniques, significant gaps remain in our understanding of the applicability and reliability of LMs in this context. Our study assesses how emerging LM technologies compare with traditional paradigms in tabular machine learning and evaluates the feasibility of adopting similar approaches with these advanced technologies. At the data level, we investigate various methods of data representation and curation of serialized tabular data, exploring their impact on prediction performance. At the classification level, we examine whether text serialization combined with LMs enhances performance on tabular datasets (e.g. class imbalance, distribution shift, biases, and high dimensionality), and assess whether this method represents a state-of-the-art (SOTA) approach for addressing tabular machine learning challenges. Our findings reveal current pre-trained models should not replace conventional approaches.
- [430] arXiv:2406.13847 [pdf, html, other]
-
Title: Locating and measuring marine aquaculture production from space: a computer vision approach in the French MediterraneanSubjects: Computer Vision and Pattern Recognition (cs.CV)
Aquaculture production -- the cultivation of aquatic plants and animals -- has grown rapidly since the 1990s, but sparse, self-reported and aggregate production data limits the effective understanding and monitoring of the industry's trends and potential risks. Building on a manual survey of aquaculture production from remote sensing imagery, we train a computer vision model to identify marine aquaculture cages from aerial and satellite imagery, and generate a spatially explicit dataset of finfish production locations in the French Mediterranean from 2000-2021 that includes 4,010 cages (69m2 average cage area). We demonstrate the value of our method as an easily adaptable, cost-effective approach that can improve the speed and reliability of aquaculture surveys, and enables downstream analyses relevant to researchers and regulators. We illustrate its use to compute independent estimates of production, and develop a flexible framework to quantify uncertainty in these estimates. Overall, our study presents an efficient, scalable and highly adaptable method for monitoring aquaculture production from remote sensing imagery.
- [431] arXiv:2406.13849 [pdf, html, other]
-
Title: Comparison of Nested Geometry Treatments within GPU-Based Monte Carlo Neutron Transport Simulations of Fission ReactorsSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Computational Engineering, Finance, and Science (cs.CE); Computational Geometry (cs.CG)
Monte Carlo (MC) neutron transport provides detailed estimates of radiological quantities within fission reactors. This method involves tracking individual neutrons through a computational geometry. CPU-based MC codes use multiple polymorphic tracker types with different tracking algorithms to exploit the repeated configurations of reactors, but virtual function calls have high overhead on the GPU. The Shift MC code was modified to support GPU-based tracking with three strategies: (1) dynamic polymorphism (DP) with virtual functions, (2) static polymorphism (SP), and (3) a single tracker (ST) type with tree-based acceleration. Results on the Frontier supercomputer show that the DP, SP, and ST methods achieve 77.8%, 91.2%, and 83.4% of the practical maximum tracking rate in the worst case, indicating that any of these methods can be used without incurring a significant performance penalty. The flexibility of the ST method is highlighted with a hexagonal-grid microreactor problem, performed without hexagonal-grid-specific tracking routines.
- [432] arXiv:2406.13851 [pdf, html, other]
-
Title: Optimizing Quantile-based Trading Strategies in Electricity ArbitrageSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Efficiently integrating renewable resources into electricity markets is vital for addressing the challenges of matching real-time supply and demand while reducing the significant energy wastage resulting from curtailments. To address this challenge effectively, the incorporation of storage devices can enhance the reliability and efficiency of the grid, improving market liquidity and reducing price volatility. In short-term electricity markets, participants navigate numerous options, each presenting unique challenges and opportunities, underscoring the critical role of the trading strategy in maximizing profits. This study delves into the optimization of day-ahead and balancing market trading, leveraging quantile-based forecasts. Employing three trading approaches with practical constraints, our research enhances forecast assessment, increases trading frequency, and employs flexible timestamp orders. Our findings underscore the profit potential of simultaneous participation in both day-ahead and balancing markets, especially with larger battery storage systems; despite increased costs and narrower profit margins associated with higher-volume trading, the implementation of high-frequency strategies plays a significant role in maximizing profits and addressing market challenges. Finally, we modelled four commercial battery storage systems and evaluated their economic viability through a scenario analysis, with larger batteries showing a shorter return on investment.
- [433] arXiv:2406.13853 [pdf, html, other]
-
Title: AltGeoViz: Facilitating Accessible GeovisualizationSubjects: Human-Computer Interaction (cs.HC)
Geovisualizations are powerful tools for exploratory spatial analysis, enabling sighted users to discern patterns, trends, and relationships within geographic data. However, these visual tools have remained largely inaccessible to screen-reader users. We present AltGeoViz, a new system we designed to facilitate geovisualization exploration for these users. AltGeoViz dynamically generates alt-text descriptions based on the user's current map view, providing summaries of spatial patterns and descriptive statistics. In a study of five screen-reader users, we found that AltGeoViz enabled them to interact with geovisualizations in previously infeasible ways. Participants demonstrated a clear understanding of data summaries and their location context, and they could synthesize spatial understandings of their explorations. Moreover, we identified key areas for improvement, such as the addition of intuitive spatial navigation controls and comparative analysis features.
- [434] arXiv:2406.13855 [pdf, html, other]
-
Title: Advancing Blockchain Scalability: An Introduction to Layer 1 and Layer 2 SolutionsSubjects: Cryptography and Security (cs.CR)
Bitcoin rise has put blockchain technology into the mainstream, amplifying its potential and broad utility. While Bitcoin has become incredibly famous, its transaction rate has not match such a corresponding increase. It still takes approximately 10 minutes to mine a block and add it to the chain. This limitation highlights the importance of seeking scale-up solutions that solve the low throughput transaction rates. Blockchain's consensus mechanisms make peer-to-peer transactions becomes feasible and effectively eliminate the need for centralized control. However, the decentralized systems also causes a lower speed and throughput compared to centralized networks as we mentioned Bitcoin's block creation rates. Two mainstreams scale-up solutions, Layer 1 scale-up and Layer 2 scale-up have been implemented to address these issues. Layer 1 level scalability enhancements happen at where traditional blockchain operates. This paper provides a deep examination of the components of the Layer 1 protocol and the scale-up methods that directly improve the lower level blockchain. We also address that Layer 1 solutions encounter inherent limitations although improvements were applied due to layer 1 storage costs and latency are high. In addition, we discuss layer 2 protocols, advanced scalability techniques, that elevate blockchain performance by handling transactions off the mainnet. Our findings indicate that Layer 2 protocols, with their various implementations such as rollups and channels, significantly outperform Layer 1 solutions in terms of transaction throughput and efficiency. This paper discusses these Layer 2 scaling methods in detail, aiming to provide readers with a comprehensive understanding of these protocols and the underlying logic that drives their effectiveness.
- [435] arXiv:2406.13856 [pdf, html, other]
-
Title: Kishu: Time-Traveling for Computational NotebooksSubjects: Databases (cs.DB)
Computational notebooks (e.g., Jupyter, Google Colab) are widely used by data scientists. A key feature of notebooks is the interactive computing model of iteratively executing cells (i.e., a set of statements) and observing the result (e.g., model or plot). Unfortunately, existing notebook systems do not offer time-traveling to past states: when the user executes a cell, the notebook session state consisting of user-defined variables can be irreversibly modified - e.g., the user cannot 'un-drop' a dataframe column. This is because, unlike DBMS, existing notebook systems do not keep track of the session state. Existing techniques for checkpointing and restoring session states, such as OS-level memory snapshot or application-level session dump, are insufficient: checkpointing can incur prohibitive storage costs and may fail, while restoration can only be inefficiently performed from scratch by fully loading checkpoint files.
In this paper, we introduce a new notebook system, Kishu, that offers time-traveling to and from arbitrary notebook states using an efficient and fault-tolerant incremental checkpoint and checkout mechanism. Kishu creates incremental checkpoints that are small and correctly preserve complex inter-variable dependencies at a novel Co-variable granularity. Then, to return to a previous state, Kishu accurately identifies the state difference between the current and target states to perform incremental checkout at sub-second latency with minimal data loading. Kishu is compatible with 146 object classes from popular data science libraries (e.g., Ray, Spark, PyTorch), and reduces checkpoint size and checkout time by up to 4.55x and 9.02x, respectively, on a variety of notebooks. - [436] arXiv:2406.13857 [pdf, html, other]
-
Title: Martian Exploration of Lava Tubes (MELT) with ReachBot: Scientific Investigation and Concept of OperationsJulia Di, Sara Cuevas-Quinones, Stephanie Newdick, Tony G. Chen, Marco Pavone, Mathieu G. A. Lapotre, Mark CutkoskyComments: In International Conference on Space Robotics 2024Subjects: Robotics (cs.RO)
As natural access points to the subsurface, lava tubes and other caves have become premier targets of planetary missions for astrobiological analyses. Few existing robotic paradigms, however, are able to explore such challenging environments. ReachBot is a robot that enables navigation in planetary caves by using extendable and retractable limbs to locomote. This paper outlines the potential science return and mission operations for a notional mission that deploys ReachBot to a martian lava tube. In this work, the motivating science goals and science traceability matrix are provided to guide payload selection. A Concept of Operations (ConOps) is also developed for ReachBot, providing a framework for deployment and activities on Mars, analyzing mission risks, and developing mitigation strategies
- [437] arXiv:2406.13858 [pdf, html, other]
-
Title: Distributional reasoning in LLMs: Parallel reasoning processes in multi-hop reasoningSubjects: Computation and Language (cs.CL)
Large language models (LLMs) have shown an impressive ability to perform tasks believed to require thought processes. When the model does not document an explicit thought process, it becomes difficult to understand the processes occurring within its hidden layers and to determine if these processes can be referred to as reasoning. We introduce a novel and interpretable analysis of internal multi-hop reasoning processes in LLMs. We demonstrate that the prediction process for compositional reasoning questions can be modeled using a simple linear transformation between two semantic category spaces. We show that during inference, the middle layers of the network generate highly interpretable embeddings that represent a set of potential intermediate answers for the multi-hop question. We use statistical analyses to show that a corresponding subset of tokens is activated in the model's output, implying the existence of parallel reasoning paths. These observations hold true even when the model lacks the necessary knowledge to solve the task. Our findings can help uncover the strategies that LLMs use to solve reasoning tasks, offering insights into the types of thought processes that can emerge from artificial intelligence. Finally, we also discuss the implication of cognitive modeling of these results.
- [438] arXiv:2406.13860 [pdf, html, other]
-
Title: Liveness Detection in Computer Vision: Transformer-based Self-Supervised Learning for Face Anti-SpoofingComments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Computer Vision and Pattern Recognition (cs.CV)
Face recognition systems are increasingly used in biometric security for convenience and effectiveness. However, they remain vulnerable to spoofing attacks, where attackers use photos, videos, or masks to impersonate legitimate users. This research addresses these vulnerabilities by exploring the Vision Transformer (ViT) architecture, fine-tuned with the DINO framework. The DINO framework facilitates self-supervised learning, enabling the model to learn distinguishing features from unlabeled data. We compared the performance of the proposed fine-tuned ViT model using the DINO framework against a traditional CNN model, EfficientNet b2, on the face anti-spoofing task. Numerous tests on standard datasets show that the ViT model performs better than the CNN model in terms of accuracy and resistance to different spoofing methods. Additionally, we collected our own dataset from a biometric application to validate our findings further. This study highlights the superior performance of transformer-based architecture in identifying complex spoofing cues, leading to significant advancements in biometric security.
- [439] arXiv:2406.13862 [pdf, html, other]
-
Title: Knowledge Graph-Enhanced Large Language Models via Path SelectionSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large Language Models (LLMs) have shown unprecedented performance in various real-world applications. However, they are known to generate factually inaccurate outputs, a.k.a. the hallucination problem. In recent years, incorporating external knowledge extracted from Knowledge Graphs (KGs) has become a promising strategy to improve the factual accuracy of LLM-generated outputs. Nevertheless, most existing explorations rely on LLMs themselves to perform KG knowledge extraction, which is highly inflexible as LLMs can only provide binary judgment on whether a certain knowledge (e.g., a knowledge path in KG) should be used. In addition, LLMs tend to pick only knowledge with direct semantic relationship with the input text, while potentially useful knowledge with indirect semantics can be ignored. In this work, we propose a principled framework KELP with three stages to handle the above problems. Specifically, KELP is able to achieve finer granularity of flexible knowledge extraction by generating scores for knowledge paths with input texts via latent semantic matching. Meanwhile, knowledge paths with indirect semantic relationships with the input text can also be considered via trained encoding between the selected paths in KG and the input text. Experiments on real-world datasets validate the effectiveness of KELP.
- [440] arXiv:2406.13864 [pdf, html, other]
-
Title: Evaluating representation learning on the protein structure universeArian R. Jamasb, Alex Morehead, Chaitanya K. Joshi, Zuobai Zhang, Kieran Didi, Simon V. Mathis, Charles Harris, Jian Tang, Jianlin Cheng, Pietro Lio, Tom L. BlundellComments: ICLR 2024Subjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
We introduce ProteinWorkshop, a comprehensive benchmark suite for representation learning on protein structures with Geometric Graph Neural Networks. We consider large-scale pre-training and downstream tasks on both experimental and predicted structures to enable the systematic evaluation of the quality of the learned structural representation and their usefulness in capturing functional relationships for downstream tasks. We find that: (1) large-scale pretraining on AlphaFold structures and auxiliary tasks consistently improve the performance of both rotation-invariant and equivariant GNNs, and (2) more expressive equivariant GNNs benefit from pretraining to a greater extent compared to invariant models. We aim to establish a common ground for the machine learning and computational biology communities to rigorously compare and advance protein structure representation learning. Our open-source codebase reduces the barrier to entry for working with large protein structure datasets by providing: (1) storage-efficient dataloaders for large-scale structural databases including AlphaFoldDB and ESM Atlas, as well as (2) utilities for constructing new tasks from the entire PDB. ProteinWorkshop is available at: this http URL.
- [441] arXiv:2406.13865 [pdf, html, other]
-
Title: SurgicAI: A Fine-grained Platform for Data Collection and Benchmarking in Surgical Policy LearningSubjects: Robotics (cs.RO)
Despite advancements in robotic-assisted surgery, automating complex tasks like suturing remain challenging due to the need for adaptability and precision. Learning-based approaches, particularly reinforcement learning (RL) and imitation learning (IL), require realistic simulation environments for efficient data collection. However, current platforms often include only relatively simple, non-dexterous manipulations and lack the flexibility required for effective learning and generalization.
We introduce SurgicAI, a novel platform for development and benchmarking addressing these challenges by providing the flexibility to accommodate both modular subtasks and more importantly task decomposition in RL-based surgical robotics. Compatible with the da Vinci Surgical System, SurgicAI offers a standardized pipeline for collecting and utilizing expert demonstrations. It supports deployment of multiple RL and IL approaches, and the training of both singular and compositional subtasks in suturing scenarios, featuring high dexterity and modularization. Meanwhile, SurgicAI sets clear metrics and benchmarks for the assessment of learned policies. We implemented and evaluated multiple RL and IL algorithms on SurgicAI. Our detailed benchmark analysis underscores SurgicAI's potential to advance policy learning in surgical robotics. Details: \url{this https URL - [442] arXiv:2406.13867 [pdf, html, other]
-
Title: Error-Correcting Graph CodesComments: 27 pages, 3 figures, 1 tableSubjects: Information Theory (cs.IT); Discrete Mathematics (cs.DM)
In this paper, we define, study, and construct {\em Error-Correcting Graph Codes}. An error-correcting graph code of distance $\delta$ is a family $C$ of graphs, on a common vertex set of size $n$, such that if we start with any graph in $C$, we would have to modify the neighborhoods of at least $\delta n$ vertices in order to reach some other graph in $C$.
This is a natural graph generalization of the standard Hamming distance error-correcting codes for binary strings. We show:
1. Combinatorial results determining the optimal rate vs distance tradeoff nonconstructively.
2. A connection to rank-metric codes, enabling some simple and some involved constructions achieving certain positive rates and distances.
3. Graph code analogues of Reed-Solomon codes and code concatenation, leading to positive distance codes for all rates and positive rate codes for all distances.
4. Graph code analogues of dual-BCH codes, yielding large codes with distance $\delta = 1-o(1)$. This gives an explicit "graph code of Ramsey graphs".
Several recent works, starting with the paper of Alon, Gujgiczer, Körner, Milojević, and Simonyi, have studied more general graph codes; where the symmetric difference between any two graphs in the code is required to have a desired property. Error-correcting graph codes are a particularly interesting instantiation of this concept. - [443] arXiv:2406.13868 [pdf, html, other]
-
Title: SDQ: Sparse Decomposed Quantization for LLM InferenceComments: PreprintSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Recently, large language models (LLMs) have shown surprising performance in task-specific workloads as well as general tasks with the given prompts. However, to achieve unprecedented performance, recent LLMs use billions to trillions of parameters, which hinder the wide adaptation of those models due to their extremely large compute and memory requirements. To resolve the issue, various model compression methods are being actively investigated. In this work, we propose SDQ (Sparse Decomposed Quantization) to exploit both structured sparsity and quantization to achieve both high compute and memory efficiency. From our evaluations, we observe that SDQ can achieve 4x effective compute throughput with <1% quality drop.
- [444] arXiv:2406.13869 [pdf, html, other]
-
Title: Global Human-guided Counterfactual Explanations for Molecular Properties via Reinforcement LearningDanqing Wang, Antonis Antoniades, Kha-Dinh Luong, Edwin Zhang, Mert Kosan, Jiachen Li, Ambuj Singh, William Yang Wang, Lei LiComments: Accepted by KDD 2024Subjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
Counterfactual explanations of Graph Neural Networks (GNNs) offer a powerful way to understand data that can naturally be represented by a graph structure. Furthermore, in many domains, it is highly desirable to derive data-driven global explanations or rules that can better explain the high-level properties of the models and data in question. However, evaluating global counterfactual explanations is hard in real-world datasets due to a lack of human-annotated ground truth, which limits their use in areas like molecular sciences. Additionally, the increasing scale of these datasets provides a challenge for random search-based methods. In this paper, we develop a novel global explanation model RLHEX for molecular property prediction. It aligns the counterfactual explanations with human-defined principles, making the explanations more interpretable and easy for experts to evaluate. RLHEX includes a VAE-based graph generator to generate global explanations and an adapter to adjust the latent representation space to human-defined principles. Optimized by Proximal Policy Optimization (PPO), the global explanations produced by RLHEX cover 4.12% more input graphs and reduce the distance between the counterfactual explanation set and the input set by 0.47% on average across three molecular datasets. RLHEX provides a flexible framework to incorporate different human-designed principles into the counterfactual explanation generation process, aligning these explanations with domain expertise. The code and data are released at this https URL.
- [445] arXiv:2406.13870 [pdf, html, other]
-
Title: Splatter a Video: Video Gaussian Representation for Versatile ProcessingSubjects: Computer Vision and Pattern Recognition (cs.CV)
Video representation is a long-standing problem that is crucial for various down-stream tasks, such as tracking,depth prediction,segmentation,view synthesis,and editing. However, current methods either struggle to model complex motions due to the absence of 3D structure or rely on implicit 3D representations that are ill-suited for manipulation tasks. To address these challenges, we introduce a novel explicit 3D representation-video Gaussian representation -- that embeds a video into 3D Gaussians. Our proposed representation models video appearance in a 3D canonical space using explicit Gaussians as proxies and associates each Gaussian with 3D motions for video motion. This approach offers a more intrinsic and explicit representation than layered atlas or volumetric pixel matrices. To obtain such a representation, we distill 2D priors, such as optical flow and depth, from foundation models to regularize learning in this ill-posed setting. Extensive applications demonstrate the versatility of our new video representation. It has been proven effective in numerous video processing tasks, including tracking, consistent video depth and feature refinement, motion and appearance editing, and stereoscopic video generation. Project page: this https URL
- [446] arXiv:2406.13871 [pdf, html, other]
-
Title: Robust Time Series Forecasting with Non-Heavy-Tailed Gaussian Loss-Weighted SamplerComments: 8 pagesSubjects: Machine Learning (cs.LG)
Forecasting multivariate time series is a computationally intensive task challenged by extreme or redundant samples. Recent resampling methods aim to increase training efficiency by reweighting samples based on their running losses. However, these methods do not solve the problems caused by heavy-tailed distribution losses, such as overfitting to outliers. To tackle these issues, we introduce a novel approach: a Gaussian loss-weighted sampler that multiplies their running losses with a Gaussian distribution weight. It reduces the probability of selecting samples with very low or very high losses while favoring those close to average losses. As it creates a weighted loss distribution that is not heavy-tailed theoretically, there are several advantages to highlight compared to existing methods: 1) it relieves the inefficiency in learning redundant easy samples and overfitting to outliers, 2) It improves training efficiency by preferentially learning samples close to the average loss. Application on real-world time series forecasting datasets demonstrate improvements in prediction quality for 1%-4% using mean square error measurements in channel-independent settings. The code will be available online after 1 the review.
- [447] arXiv:2406.13872 [pdf, html, other]
-
Title: Least SQuares Discretizations (LSQD): a robust and versatile meshless paradigm for solving elliptic PDEsSubjects: Numerical Analysis (math.NA)
Searching for numerical methods that combine facility and efficiency, while remaining accurate and versatile, is critical. Often, irregular geometries challenge traditional methods that rely on structured or body-fitted meshes. Meshless methods mitigate these issues but oftentimes require the weak formulation which involves defining quadrature rules over potentially intricate geometries. To overcome these challenges, we propose the Least Squares Discretization (LSQD) method. This novel approach simplifies the application of meshless methods by eliminating the need for a weak formulation and necessitates minimal numerical analysis. It offers significant advantages in terms of ease of implementation and adaptability to complex geometries. In this paper, we demonstrate the efficacy of the LSQD method in solving elliptic partial differential equations for a variety of boundary conditions, geometries, and data layouts. We monitor h-P convergence across these parameters and construct an a posteriori built-in error estimator to establish our method as a robust and accessible numerical alternative.
- [448] arXiv:2406.13873 [pdf, html, other]
-
Title: A Pure Transformer Pretraining Framework on Text-attributed GraphsYu Song, Haitao Mao, Jiachen Xiao, Jingzhe Liu, Zhikai Chen, Wei Jin, Carl Yang, Jiliang Tang, Hui LiuSubjects: Artificial Intelligence (cs.AI)
Pretraining plays a pivotal role in acquiring generalized knowledge from large-scale data, achieving remarkable successes as evidenced by large models in CV and NLP. However, progress in the graph domain remains limited due to fundamental challenges such as feature heterogeneity and structural heterogeneity. Recently, increasing efforts have been made to enhance node feature quality with Large Language Models (LLMs) on text-attributed graphs (TAGs), demonstrating superiority to traditional bag-of-words or word2vec techniques. These high-quality node features reduce the previously critical role of graph structure, resulting in a modest performance gap between Graph Neural Networks (GNNs) and structure-agnostic Multi-Layer Perceptrons (MLPs). Motivated by this, we introduce a feature-centric pretraining perspective by treating graph structure as a prior and leveraging the rich, unified feature space to learn refined interaction patterns that generalizes across graphs. Our framework, Graph Sequence Pretraining with Transformer (GSPT), samples node contexts through random walks and employs masked feature reconstruction to capture pairwise proximity in the LLM-unified feature space using a standard Transformer. By utilizing unified text representations rather than varying structures, our framework achieves significantly better transferability among graphs within the same domain. GSPT can be easily adapted to both node classification and link prediction, demonstrating promising empirical success on various datasets.
- [449] arXiv:2406.13875 [pdf, html, other]
-
Title: WATT: Weight Average Test-Time Adaption of CLIPDavid Osowiechi, Mehrdad Noori, Gustavo Adolfo Vargas Hakim, Moslem Yazdanpanah, Ali Bahri, Milad Cheraghalikhani, Sahar Dastani, Farzad Beizaee, Ismail Ben Ayed, Christian DesrosiersSubjects: Computer Vision and Pattern Recognition (cs.CV)
Vision-Language Models (VLMs) such as CLIP have yielded unprecedented performance for zero-shot image classification, yet their generalization capability may still be seriously challenged when confronted to domain shifts. In response, we present Weight Average Test-Time Adaptation (WATT) of CLIP, a pioneering approach facilitating full test-time adaptation (TTA) of this VLM. Our method employs a diverse set of templates for text prompts, augmenting the existing framework of CLIP. Predictions are utilized as pseudo labels for model updates, followed by weight averaging to consolidate the learned information globally. Furthermore, we introduce a text ensemble strategy, enhancing overall test performance by aggregating diverse textual cues. Our findings underscore the efficacy of WATT in enhancing performance across diverse datasets, including CIFAR-10-C, CIFAR-10.1, CIFAR-100-C, VisDA-C, and several other challenging datasets, effectively covering a wide range of domain shifts. Notably, these enhancements are achieved without necessitating additional model transformations or trainable modules. Moreover, compared to other Test-Time Adaptation methods, our approach can operate effectively with just a single image. Highlighting the potential of innovative test-time strategies, this research emphasizes their role in fortifying the adaptability of VLMs. The implementation is available at: \url{this https URL}.
- [450] arXiv:2406.13877 [pdf, other]
-
Title: A Systematic Literature Review on the Use of Machine Learning in Software EngineeringComments: 29 pages, 1 figure, 1 tableSubjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
Software engineering (SE) is a dynamic field that involves multiple phases all of which are necessary to develop sustainable software systems. Machine learning (ML), a branch of artificial intelligence (AI), has drawn a lot of attention in recent years thanks to its ability to analyze massive volumes of data and extract useful patterns from data. Several studies have focused on examining, categorising, and assessing the application of ML in SE processes. We conducted a literature review on primary studies to address this gap. The study was carried out following the objective and the research questions to explore the current state of the art in applying machine learning techniques in software engineering processes. The review identifies the key areas within software engineering where ML has been applied, including software quality assurance, software maintenance, software comprehension, and software documentation. It also highlights the specific ML techniques that have been leveraged in these domains, such as supervised learning, unsupervised learning, and deep learning.
Keywords: machine learning, deep learning, software engineering, natural language processing, source code - [451] arXiv:2406.13880 [pdf, html, other]
-
Title: Privacy-Preserving ECG Data Analysis with Differential Privacy: A Literature Review and A Case StudySubjects: Cryptography and Security (cs.CR)
Differential privacy has become the preeminent technique to protect the privacy of individuals in a database while allowing useful results from data analysis to be shared. Notably, it guarantees the amount of privacy loss in the worst-case scenario. Although many theoretical research papers have been published, practical real-life application of differential privacy demands estimating several important parameters without any clear solutions or guidelines. In the first part of the paper, we provide an overview of key concepts in differential privacy, followed by a literature review and discussion of its application to ECG analysis. In the second part of the paper, we explore how to implement differentially private query release on an arrhythmia database using a six-step process. We provide guidelines and discuss the related literature for all the steps involved, such as selection of the $\epsilon$ value, distribution of the total $\epsilon$ budget across the queries, and estimation of the sensitivity for the query functions. At the end, we discuss the shortcomings and challenges of applying differential privacy to ECG datasets.
- [452] arXiv:2406.13881 [pdf, html, other]
-
Title: Static Generation of Efficient OpenMP Offload Data MappingsComments: Accepted to the 2024 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC24)Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Increasing heterogeneity in HPC architectures and compiler advancements have led to OpenMP being frequently used to enable computations on heterogeneous devices. However, the efficient movement of data on heterogeneous computing platforms is crucial for achieving high utilization. The implicit OpenMP data-mapping rules often result in redundant data transfer, which can be a bottleneck for program performance. Programmers must explicitly map data between the host and connected accelerator devices to achieve efficient data movement. For this, OpenMP offers the target data and target update constructs. Ensuring efficient data transfer requires programmers to reason about complex data flow. This can be a laborious and error-prone process since the programmer must keep a mental model of data validity and lifetime spanning multiple data environments. Any automated analysis should maximize data reuse, minimize data transfer, and must consider control flow and context from function call sites, making the analysis interprocedural and context sensitive. In this paper, we present a static analysis tool, OMPDart (OpenMP DAta Reduction Tool), for OpenMP programs that models data dependencies between host and device regions and applies source code transformations to achieve efficient data transfer. The analysis is based on a hybrid data structure that joins an Abstract Syntax Tree (AST) with a Control Flow Graph (CFG). Our evaluations on nine HPC benchmarks demonstrate that OMPDart is capable of generating effective data mapping constructs that substantially reduce data transfer between host and device. OMPDart helps reduce data transfers by 85% and improves runtime performance by 1.6x over an expert-defined implementation of LULESH 2.0.
- [453] arXiv:2406.13882 [pdf, other]
-
Title: Allocation Requires Prediction Only if Inequality Is LowComments: Appeared in Forty-first International Conference on Machine Learning (ICML), 2024Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Theoretical Economics (econ.TH)
Algorithmic predictions are emerging as a promising solution concept for efficiently allocating societal resources. Fueling their use is an underlying assumption that such systems are necessary to identify individuals for interventions. We propose a principled framework for assessing this assumption: Using a simple mathematical model, we evaluate the efficacy of prediction-based allocations in settings where individuals belong to larger units such as hospitals, neighborhoods, or schools. We find that prediction-based allocations outperform baseline methods using aggregate unit-level statistics only when between-unit inequality is low and the intervention budget is high. Our results hold for a wide range of settings for the price of prediction, treatment effect heterogeneity, and unit-level statistics' learnability. Combined, we highlight the potential limits to improving the efficacy of interventions through prediction.
- [454] arXiv:2406.13883 [pdf, html, other]
-
Title: We Are The Clouds: Blending Interaction and Participation in Urban Media ArtSubjects: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
Since the early 2000s, cultural institutions have been instrumental in reshaping public spaces, fostering community engagement, and nurturing artistic innovation. Central to these initiatives are audience interaction and participation concepts, yet their definitions and applications in urban media art remain nebulous. This article endeavours to demystify these terms, examining the distinct characteristics and intersections of interactive and participatory art within urban contexts. A particular emphasis is placed on artworks that harmonise both elements, exploring the motivations and outcomes of this synthesis. The case study of We Are The Clouds serves as a focal point, exemplifying how strategic integration of interaction and participation can enhance community connection and reinvigorate public spaces. Through this analysis, the paper underscores the transformative power of urban media artworks in redefining neighbourhood experiences, empowering local voices, and revitalising the essence of public realms.
- [455] arXiv:2406.13885 [pdf, html, other]
-
Title: Knowledge Tagging System on Math Questions via LLMs with Flexible Demonstration RetrieverComments: 13 pages, 6 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Knowledge tagging for questions plays a crucial role in contemporary intelligent educational applications, including learning progress diagnosis, practice question recommendations, and course content organization. Traditionally, these annotations are always conducted by pedagogical experts, as the task requires not only a strong semantic understanding of both question stems and knowledge definitions but also deep insights into connecting question-solving logic with corresponding knowledge concepts. With the recent emergence of advanced text encoding algorithms, such as pre-trained language models, many researchers have developed automatic knowledge tagging systems based on calculating the semantic similarity between the knowledge and question embeddings. In this paper, we explore automating the task using Large Language Models (LLMs), in response to the inability of prior encoding-based methods to deal with the hard cases which involve strong domain knowledge and complicated concept definitions. By showing the strong performance of zero- and few-shot results over math questions knowledge tagging tasks, we demonstrate LLMs' great potential in conquering the challenges faced by prior methods. Furthermore, by proposing a reinforcement learning-based demonstration retriever, we successfully exploit the great potential of different-sized LLMs in achieving better performance results while keeping the in-context demonstration usage efficiency high.
- [456] arXiv:2406.13890 [pdf, html, other]
-
Title: ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real WorldWeixiang Yan, Haitian Liu, Tengxiao Wu, Qian Chen, Wen Wang, Haoyuan Chai, Jiayi Wang, Weishan Zhao, Yixin Zhang, Renjun Zhang, Li ZhuSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
LLMs have achieved significant performance progress in various NLP applications. However, LLMs still struggle to meet the strict requirements for accuracy and reliability in the medical field and face many challenges in clinical applications. Existing clinical diagnostic evaluation benchmarks for evaluating medical agents powered by LLMs have severe limitations. Firstly, most existing medical evaluation benchmarks face the risk of data leakage or contamination. Secondly, existing benchmarks often neglect the characteristics of multiple departments and specializations in modern medical practice. Thirdly, existing evaluation methods are limited to multiple-choice questions, which do not align with the real-world diagnostic scenarios. Lastly, existing evaluation methods lack comprehensive evaluations of end-to-end real clinical scenarios. These limitations in benchmarks in turn obstruct advancements of LLMs and agents for medicine. To address these limitations, we introduce ClinicalLab, a comprehensive clinical diagnosis agent alignment suite. ClinicalLab includes ClinicalBench, an end-to-end multi-departmental clinical diagnostic evaluation benchmark for evaluating medical agents and LLMs. ClinicalBench is based on real cases that cover 24 departments and 150 diseases. ClinicalLab also includes four novel metrics (ClinicalMetrics) for evaluating the effectiveness of LLMs in clinical diagnostic tasks. We evaluate 17 LLMs and find that their performance varies significantly across different departments. Based on these findings, in ClinicalLab, we propose ClinicalAgent, an end-to-end clinical agent that aligns with real-world clinical diagnostic practices. We systematically investigate the performance and applicable scenarios of variants of ClinicalAgent on ClinicalBench. Our findings demonstrate the importance of aligning with modern medical practices in designing medical agents.
- [457] arXiv:2406.13891 [pdf, html, other]
-
Title: DPO: Dual-Perturbation Optimization for Test-time Adaptation in 3D Object DetectionComments: 13 pages, 7 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
LiDAR-based 3D object detection has seen impressive advances in recent times. However, deploying trained 3D detectors in the real world often yields unsatisfactory performance when the distribution of the test data significantly deviates from the training data due to different weather conditions, object sizes, \textit{etc}. A key factor in this performance degradation is the diminished generalizability of pre-trained models, which creates a sharp loss landscape during training. Such sharpness, when encountered during testing, can precipitate significant performance declines, even with minor data variations. To address the aforementioned challenges, we propose \textbf{dual-perturbation optimization (DPO)} for \textbf{\underline{T}est-\underline{t}ime \underline{A}daptation in \underline{3}D \underline{O}bject \underline{D}etection (TTA-3OD)}. We minimize the sharpness to cultivate a flat loss landscape to ensure model resiliency to minor data variations, thereby enhancing the generalization of the adaptation process. To fully capture the inherent variability of the test point clouds, we further introduce adversarial perturbation to the input BEV features to better simulate the noisy test environment. As the dual perturbation strategy relies on trustworthy supervision signals, we utilize a reliable Hungarian matcher to filter out pseudo-labels sensitive to perturbations. Additionally, we introduce early Hungarian cutoff to avoid error accumulation from incorrect pseudo-labels by halting the adaptation process. Extensive experiments across three types of transfer tasks demonstrate that the proposed DPO significantly surpasses previous state-of-the-art approaches, specifically on Waymo $\rightarrow$ KITTI, outperforming the most competitive baseline by 57.72\% in $\text{AP}_\text{3D}$ and reaching 91\% of the fully supervised upper bound.
- [458] arXiv:2406.13892 [pdf, html, other]
-
Title: Adaptable Logical Control for Large Language ModelsSubjects: Computation and Language (cs.CL)
Despite the success of Large Language Models (LLMs) on various tasks following human instructions, controlling model generation at inference time poses a persistent challenge. In this paper, we introduce Ctrl-G, an adaptable framework that facilitates tractable and flexible control of LLM generation to reliably follow logical constraints. Ctrl-G combines any production-ready LLM with a Hidden Markov Model, enabling LLM outputs to adhere to logical constraints represented as deterministic finite automata. We show that Ctrl-G, when applied to a TULU2-7B model, outperforms GPT3.5 and GPT4 on the task of interactive text editing: specifically, for the task of generating text insertions/continuations following logical constraints, Ctrl-G achieves over 30% higher satisfaction rate in human evaluation compared to GPT4. When applied to medium-size language models (e.g., GPT2-large), Ctrl-G also beats its counterparts for constrained generation by large margins on standard benchmarks. Additionally, as a proof-of-concept study, we experiment Ctrl-G on the Grade School Math benchmark to assist LLM reasoning, foreshadowing the application of Ctrl-G, as well as other constrained generation approaches, beyond traditional language generation tasks.
- [459] arXiv:2406.13893 [pdf, html, other]
-
Title: Open Generative Large Language Models for GalicianPablo Gamallo, Pablo Rodríguez, Iria de-Dios-Flores, Susana Sotelo, Silvia Paniagua, Daniel Bardanca, José Ramom Pichel, Marcos GarciaComments: 12 pages, 1 figureSubjects: Computation and Language (cs.CL)
Large language models (LLMs) have transformed natural language processing. Yet, their predominantly English-centric training has led to biases and performance disparities across languages. This imbalance marginalizes minoritized languages, making equitable access to NLP technologies more difficult for languages with lower resources, such as Galician. We present the first two generative LLMs focused on Galician to bridge this gap. These models, freely available as open-source resources, were trained using a GPT architecture with 1.3B parameters on a corpus of 2.1B words. Leveraging continual pretraining, we adapt to Galician two existing LLMs trained on larger corpora, thus mitigating the data constraints that would arise if the training were performed from scratch. The models were evaluated using human judgments and task-based datasets from standardized benchmarks. These evaluations reveal a promising performance, underscoring the importance of linguistic diversity in generative models.
- [460] arXiv:2406.13894 [pdf, other]
-
Title: Using Multimodal Large Language Models for Automated Detection of Traffic Safety Critical EventsSubjects: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
Traditional approaches to safety event analysis in autonomous systems have relied on complex machine learning models and extensive datasets for high accuracy and reliability. However, the advent of Multimodal Large Language Models (MLLMs) offers a novel approach by integrating textual, visual, and audio modalities, thereby providing automated analyses of driving videos. Our framework leverages the reasoning power of MLLMs, directing their output through context-specific prompts to ensure accurate, reliable, and actionable insights for hazard detection. By incorporating models like Gemini-Pro-Vision 1.5 and Llava, our methodology aims to automate the safety critical events and mitigate common issues such as hallucinations in MLLM outputs. Preliminary results demonstrate the framework's potential in zero-shot learning and accurate scenario analysis, though further validation on larger datasets is necessary. Furthermore, more investigations are required to explore the performance enhancements of the proposed framework through few-shot learning and fine-tuned models. This research underscores the significance of MLLMs in advancing the analysis of the naturalistic driving videos by improving safety-critical event detecting and understanding the interaction with complex environments.
- [461] arXiv:2406.13896 [pdf, html, other]
-
Title: Simultaneous Map and Object ReconstructionSubjects: Computer Vision and Pattern Recognition (cs.CV)
In this paper, we present a method for dynamic surface reconstruction of large-scale urban scenes from LiDAR. Depth-based reconstructions tend to focus on small-scale objects or large-scale SLAM reconstructions that treat moving objects as outliers. We take a holistic perspective and optimize a compositional model of a dynamic scene that decomposes the world into rigidly moving objects and the background. To achieve this, we take inspiration from recent novel view synthesis methods and pose the reconstruction problem as a global optimization, minimizing the distance between our predicted surface and the input LiDAR scans. We show how this global optimization can be decomposed into registration and surface reconstruction steps, which are handled well by off-the-shelf methods without any re-training. By careful modeling of continuous-time motion, our reconstructions can compensate for the rolling shutter effects of rotating LiDAR sensors. This allows for the first system (to our knowledge) that properly motion compensates LiDAR scans for rigidly-moving objects, complementing widely-used techniques for motion compensation of static scenes. Beyond pursuing dynamic reconstruction as a goal in and of itself, we also show that such a system can be used to auto-label partially annotated sequences and produce ground truth annotation for hard-to-label problems such as depth completion and scene flow.
- [462] arXiv:2406.13897 [pdf, html, other]
-
Title: CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D AssetsLongwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, Jingyi YuSubjects: Computer Vision and Pattern Recognition (cs.CV)
In the realm of digital creativity, our potential to craft intricate 3D worlds from imagination is often hampered by the limitations of existing digital tools, which demand extensive expertise and efforts. To narrow this disparity, we introduce CLAY, a 3D geometry and material generator designed to effortlessly transform human imagination into intricate 3D digital structures. CLAY supports classic text or image inputs as well as 3D-aware controls from diverse primitives (multi-view images, voxels, bounding boxes, point clouds, implicit representations, etc). At its core is a large-scale generative model composed of a multi-resolution Variational Autoencoder (VAE) and a minimalistic latent Diffusion Transformer (DiT), to extract rich 3D priors directly from a diverse range of 3D geometries. Specifically, it adopts neural fields to represent continuous and complete surfaces and uses a geometry generative module with pure transformer blocks in latent space. We present a progressive training scheme to train CLAY on an ultra large 3D model dataset obtained through a carefully designed processing pipeline, resulting in a 3D native geometry generator with 1.5 billion parameters. For appearance generation, CLAY sets out to produce physically-based rendering (PBR) textures by employing a multi-view material diffusion model that can generate 2K resolution textures with diffuse, roughness, and metallic modalities. We demonstrate using CLAY for a range of controllable 3D asset creations, from sketchy conceptual designs to production ready assets with intricate details. Even first time users can easily use CLAY to bring their vivid 3D imaginations to life, unleashing unlimited creativity.
- [463] arXiv:2406.13898 [pdf, other]
-
Title: The Use of Multimodal Large Language Models to Detect Objects from Thermal Images: Transportation ApplicationsSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Computers and Society (cs.CY)
The integration of thermal imaging data with Multimodal Large Language Models (MLLMs) constitutes an exciting opportunity for improving the safety and functionality of autonomous driving systems and many Intelligent Transportation Systems (ITS) applications. This study investigates whether MLLMs can understand complex images from RGB and thermal cameras and detect objects directly. Our goals were to 1) assess the ability of the MLLM to learn from information from various sets, 2) detect objects and identify elements in thermal cameras, 3) determine whether two independent modality images show the same scene, and 4) learn all objects using different modalities. The findings showed that both GPT-4 and Gemini were effective in detecting and classifying objects in thermal images. Similarly, the Mean Absolute Percentage Error (MAPE) for pedestrian classification was 70.39% and 81.48%, respectively. Moreover, the MAPE for bike, car, and motorcycle detection were 78.4%, 55.81%, and 96.15%, respectively. Gemini produced MAPE of 66.53%, 59.35% and 78.18% respectively. This finding further demonstrates that MLLM can identify thermal images and can be employed in advanced imaging automation technologies for ITS applications.
- [464] arXiv:2406.13903 [pdf, html, other]
-
Title: Generative AI for Enhancing Active Learning in Education: A Comparative Study of GPT-3.5 and GPT-4 in Crafting Customized Test QuestionsComments: Publisher: Canadian Artificial Intelligence Association. URL: this https URLJournal-ref: Proceedings of The 37th Canadian Conference on Artificial Intelligence, 2024Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
This study investigates how LLMs, specifically GPT-3.5 and GPT-4, can develop tailored questions for Grade 9 math, aligning with active learning principles. By utilizing an iterative method, these models adjust questions based on difficulty and content, responding to feedback from a simulated 'student' model. A novel aspect of the research involved using GPT-4 as a 'teacher' to create complex questions, with GPT-3.5 as the 'student' responding to these challenges. This setup mirrors active learning, promoting deeper engagement. The findings demonstrate GPT-4's superior ability to generate precise, challenging questions and notable improvements in GPT-3.5's ability to handle more complex problems after receiving instruction from GPT-4. These results underscore the potential of LLMs to mimic and enhance active learning scenarios, offering a promising path for AI in customized education. This research contributes to understanding how AI can support personalized learning experiences, highlighting the need for further exploration in various educational contexts
- [465] arXiv:2406.13904 [pdf, html, other]
-
Title: Fitting micro-kinetic models to transient kinetics of temporal analysis of product reactors using kinetics-informed neural networksComments: 18 pages main, 13 pages SI, 16 figures, 5 tablesSubjects: Computational Engineering, Finance, and Science (cs.CE)
The temporal analysis of products (TAP) technique produces extensive transient kinetic data sets, but it is challenging to translate the large quantity of raw data into physically interpretable kinetic models, largely due to the computational scaling of existing numerical methods for fitting TAP data. In this work, we utilize kinetics-informed neural networks (KINNs), which are artificial feedforward neural networks designed to solve ordinary differential equations constrained by micro-kinetic models, to model the TAP data. We demonstrate that, under the assumption that all concentrations are known in the thin catalyst zone, KINNs can simultaneously fit the transient data, retrieve the kinetic model parameters, and interpolate unseen pulse behavior for multi-pulse experiments. We further demonstrate that, by modifying the loss function, KINNs maintain these capabilities even when precise thin-zone information is unavailable, as would be the case with real experimental TAP data. We also compare the approach to existing optimization techniques, which reveals improved noise tolerance and performance in extracting kinetic parameters. The KINNs approach offers an efficient alternative for TAP analysis and can assist in interpreting transient kinetics in complex systems over long timescales.
- [466] arXiv:2406.13905 [pdf, html, other]
-
Title: Persuasiveness of Generated Free-Text Rationales in Subjective Decisions: A Case Study on Pairwise Argument RankingSubjects: Computation and Language (cs.CL)
Generating free-text rationales is among the emergent capabilities of Large Language Models (LLMs). These rationales have been found to enhance LLM performance across various NLP tasks. Recently, there has been growing interest in using these rationales to provide insights for various important downstream tasks. In this paper, we analyze generated free-text rationales in tasks with subjective answers, emphasizing the importance of rationalization in such scenarios. We focus on pairwise argument ranking, a highly subjective task with significant potential for real-world applications, such as debate assistance. We evaluate the persuasiveness of rationales generated by nine LLMs to support their subjective choices. Our findings suggest that open-source LLMs, particularly Llama2-70B-chat, are capable of providing highly persuasive rationalizations, surpassing even GPT models. Additionally, our experiments show that rationale persuasiveness can be improved by controlling its parameters through prompting or through self-refinement.
- [467] arXiv:2406.13908 [pdf, html, other]
-
Title: A Decision-Making GPT Model Augmented with Entropy Regularization for Autonomous VehiclesSubjects: Robotics (cs.RO)
In the domain of autonomous vehicles (AVs), decision-making is a critical factor that significantly influences the efficacy of autonomous navigation. As the field progresses, the enhancement of decision-making capabilities in complex environments has become a central area of research within data-driven methodologies. Despite notable advances, existing learning-based decision-making strategies in autonomous vehicles continue to reveal opportunities for further refinement, particularly in the articulation of policies and the assurance of safety. In this study, the decision-making challenges associated with autonomous vehicles are conceptualized through the framework of the Constrained Markov Decision Process (CMDP) and approached as a sequence modeling problem. Utilizing the Generative Pre-trained Transformer (GPT), we introduce a novel decision-making model tailored for AVs, which incorporates entropy regularization techniques to bolster exploration and enhance safety performance. Comprehensive experiments conducted across various scenarios affirm that our approach surpasses several established baseline methods, particularly in terms of safety and overall efficacy.
- [468] arXiv:2406.13909 [pdf, html, other]
-
Title: Beyond Optimism: Exploration With Partially Observable RewardsSubjects: Machine Learning (cs.LG)
Exploration in reinforcement learning (RL) remains an open challenge. RL algorithms rely on observing rewards to train the agent, and if informative rewards are sparse the agent learns slowly or may not learn at all. To improve exploration and reward discovery, popular algorithms rely on optimism. But what if sometimes rewards are unobservable, e.g., situations of partial monitoring in bandits and the recent formalism of monitored Markov decision process? In this case, optimism can lead to suboptimal behavior that does not explore further to collapse uncertainty. With this paper, we present a novel exploration strategy that overcomes the limitations of existing methods and guarantees convergence to an optimal policy even when rewards are not always observable. We further propose a collection of tabular environments for benchmarking exploration in RL (with and without unobservable rewards) and show that our method outperforms existing ones.
- [469] arXiv:2406.13910 [pdf, html, other]
-
Title: A-OctoMap: An Adaptive OctoMap for Online Motion PlanningComments: 8 pages, 6 figuresSubjects: Robotics (cs.RO); Graphics (cs.GR)
Traditional robotic motion planning methods often struggle with fixed resolutions in dynamically changing environments. To address these challenges, we introduce the A-OctoMap, an adaptive Octo-Tree structure that enhances spatial representation and facilitates real-time, efficient motion planning. This novel framework allows for dynamic space partitioning and multi-resolution queries, significantly improving computational efficiency and precision. Key innovations include a tree-based data structure for enhanced geometric processing, real-time map updating for accurate trajectory planning, and efficient collision detection. Our extensive testing shows superior navigation safety and efficiency in complex settings compared to conventional methods. A-OctoMap sets a new standard for adaptive spatial mapping in autonomous systems, promising significant advancements in navigating unpredictable environments.
- [470] arXiv:2406.13911 [pdf, html, other]
-
Title: Planning Against a Prophet: A Graph-Theoretic Framework for Making Sequential DecisionsSubjects: Computer Science and Game Theory (cs.GT)
We devise a general graph-theoretic framework for studying prophet inequalities. In this framework, an agent traverses a directed acyclic graph from a starting node $s$ to a target node $t$. Each edge has a value that is sampled from a known distribution. When the agent reaches a node $v$ it observes the realized values of all the outgoing edges from $v$. The agent's objective is to maximize the expected total value of the path it takes. As in prophet inequalities, we compare the agent's performance against a prophet who observes all the realizations of the edges' values ahead of time. Our analysis reveals that this ratio highly depends on the number of paths $k$ required to cover all the nodes in the graph. In particular, we provide an algorithm that guarantees a prophet inequality ratio of $\frac{1}{2k}$ and show an upper bound of $\frac{1}{k+1}$.
Our framework captures planning problems in which there is uncertainty regarding the costs/benefits of each action. In particular, it captures an over-time variant of the classic prophet inequality in which a seller leases a durable item, such as an apartment, for $n$ time units. Each period a lessee appears and may have a different value for each lease term. We obtain a tight bound of $1/2$ for this variant. To make this framework even more expressive, we further generalize it to accommodate correlations between edges originating from the same node and allow for additional constraints on the edges the agent can take. The generalized framework captures many well-studied prophet inequality problems, including $d$-dimensional matching, $k$-prophet inequality, and more. - [471] arXiv:2406.13912 [pdf, html, other]
-
Title: From Descriptive Richness to Bias: Unveiling the Dark Side of Generative Image Caption EnrichmentSubjects: Computer Vision and Pattern Recognition (cs.CV)
Large language models (LLMs) have enhanced the capacity of vision-language models to caption visual text. This generative approach to image caption enrichment further makes textual captions more descriptive, improving alignment with the visual context. However, while many studies focus on benefits of generative caption enrichment (GCE), are there any negative side effects? We compare standard-format captions and recent GCE processes from the perspectives of "gender bias" and "hallucination", showing that enriched captions suffer from increased gender bias and hallucination. Furthermore, models trained on these enriched captions amplify gender bias by an average of 30.9% and increase hallucination by 59.5%. This study serves as a caution against the trend of making captions more descriptive.
- [472] arXiv:2406.13918 [pdf, html, other]
-
Title: Are We There Yet? Unravelling Usability Challenges and Opportunities in Collaborative Immersive Analytics for Domain ExpertsComments: Accepted in 26th International Conference on Human-Computer Interaction, HCII 2024, Washington, DC, USASubjects: Human-Computer Interaction (cs.HC)
In the ever-evolving discipline of high-dimensional scientific data, collaborative immersive analytics (CIA) offers a promising frontier for domain experts in complex data visualization and interpretation. This research presents a comprehensive framework for conducting usability studies on the extended reality (XR) interface of ParaView, an open-source CIA system. By employing established human-computer interaction (HCI) principles, including Jakob Nielsen's Usability Heuristics, Cognitive Load Theory, NASA Task Load Index, System Usability Scale, Affordance Theory, and Gulf of Execution and Evaluation, this study aims to identify underlying usability issues and provide guidelines for enhancing user experience in scientific domains. Our findings reveal significant usability challenges in the ParaView XR interface that impede effective teamwork and collaboration. For instance, the lack of synchronous collaboration, limited communication methods, and the absence of role-based data access are critical areas that need attention. Additionally, inadequate error handling, insufficient feedback mechanisms, and limited support resources during application use require extensive improvement to fully utilize the system's potential. Our study suggests potential improvements to overcome the existing usability barriers of the collaborative immersive system.
- [473] arXiv:2406.13919 [pdf, html, other]
-
Title: SPL: A Socratic Playground for Learning Powered by Large Language ModeSubjects: Artificial Intelligence (cs.AI)
Dialogue-based Intelligent Tutoring Systems (ITSs) have significantly advanced adaptive and personalized learning by automating sophisticated human tutoring strategies within interactive dialogues. However, replicating the nuanced patterns of expert human communication remains a challenge in Natural Language Processing (NLP). Recent advancements in NLP, particularly Large Language Models (LLMs) such as OpenAI's GPT-4, offer promising solutions by providing human-like and context-aware responses based on extensive pre-trained knowledge. Motivated by the effectiveness of LLMs in various educational tasks (e.g., content creation and summarization, problem-solving, and automated feedback provision), our study introduces the Socratic Playground for Learning (SPL), a dialogue-based ITS powered by the GPT-4 model, which employs the Socratic teaching method to foster critical thinking among learners. Through extensive prompt engineering, SPL can generate specific learning scenarios and facilitates efficient multi-turn tutoring dialogues. The SPL system aims to enhance personalized and adaptive learning experiences tailored to individual needs, specifically focusing on improving critical thinking skills. Our pilot experimental results from essay writing tasks demonstrate SPL has the potential to improve tutoring interactions and further enhance dialogue-based ITS functionalities. Our study, exemplified by SPL, demonstrates how LLMs enhance dialogue-based ITSs and expand the accessibility and efficacy of educational technologies.
- [474] arXiv:2406.13920 [pdf, html, other]
-
Title: Explainable AI Security: Exploring Robustness of Graph Neural Networks to Adversarial AttacksSubjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
Graph neural networks (GNNs) have achieved tremendous success, but recent studies have shown that GNNs are vulnerable to adversarial attacks, which significantly hinders their use in safety-critical scenarios. Therefore, the design of robust GNNs has attracted increasing attention. However, existing research has mainly been conducted via experimental trial and error, and thus far, there remains a lack of a comprehensive understanding of the vulnerability of GNNs. To address this limitation, we systematically investigate the adversarial robustness of GNNs by considering graph data patterns, model-specific factors, and the transferability of adversarial examples. Through extensive experiments, a set of principled guidelines is obtained for improving the adversarial robustness of GNNs, for example: (i) rather than highly regular graphs, the training graph data with diverse structural patterns is crucial for model robustness, which is consistent with the concept of adversarial training; (ii) the large model capacity of GNNs with sufficient training data has a positive effect on model robustness, and only a small percentage of neurons in GNNs are affected by adversarial attacks; (iii) adversarial transfer is not symmetric and the adversarial examples produced by the small-capacity model have stronger adversarial transferability. This work illuminates the vulnerabilities of GNNs and opens many promising avenues for designing robust GNNs.
- [475] arXiv:2406.13922 [pdf, html, other]
-
Title: Explicit Performance Bound of Finite Blocklength Coded MIMO: Time-Domain versus Spatiotemporal Channel CodingComments: 9 pages, 5 figuresSubjects: Information Theory (cs.IT)
In the sixth generation (6G), ultra-reliable low-latency communications (URLLC) will further develop to achieve TKu extreme connectivity, and multiple-input multiple-output (MIMO) is expected to be a key enabler for its realization. Since the latency constraint can be represented by the blocklength of a codeword, it is essential to analyze different coded MIMO schemes under finite blocklength regime. In this paper, we analyze the statistical characteristics of information density of time-domain coding and spatiotemporal coding MIMO, compute the channel capacity and dispersion, and present new explicit performance bounds of finite blocklength coded MIMO for different coding modes via normal approximation. As revealed by the analysis and simulation, spatiotemporal coding can effectively mitigate the performance loss induced by short blocklength by increasing the spatial degree of freedom (DoF). However, for time-domain coding, each spatial link is encoded independently, and the performance loss will be more severe with short blocklength under any spatial DoF. These results indicate that spatiotemporal coding can optimally exploit the spatial dimension advantages of MIMO systems compared with time-domain coding, and it has the potential to support URLLC transmission, enabling very low error-rate communication under stringent blocklength constraint.
- [476] arXiv:2406.13923 [pdf, html, other]
-
Title: PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal DocumentsJunjie Wang, Yin Zhang, Yatai Ji, Yuxiang Zhang, Chunyang Jiang, Yubo Wang, Kang Zhu, Zekun Wang, Tiezhen Wang, Wenhao Huang, Jie Fu, Bei Chen, Qunshu Lin, Minghao Liu, Ge Zhang, Wenhu ChenSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Recent advancements in Large Multimodal Models (LMMs) have leveraged extensive multimodal datasets to enhance capabilities in complex knowledge-driven tasks. However, persistent challenges in perceptual and reasoning errors limit their efficacy, particularly in interpreting intricate visual data and deducing multimodal relationships. Addressing these issues, we introduce a novel dataset format, PIN (Paired and INterleaved multimodal documents), designed to significantly improve both the depth and breadth of multimodal training. The PIN format is built on three foundational principles: knowledge intensity, scalability, and support for diverse training modalities. This innovative format combines markdown files and comprehensive images to enrich training data with a dense knowledge structure and versatile training strategies. We present PIN-14M, an open-source dataset comprising 14 million samples derived from a diverse range of Chinese and English sources, tailored to include complex web and scientific content. This dataset is constructed meticulously to ensure data quality and ethical integrity, aiming to facilitate advanced training strategies and improve model robustness against common multimodal training pitfalls. Our initial results, forming the basis of this technical report, suggest significant potential for the PIN format in refining LMM performance, with plans for future expansions and detailed evaluations of its impact on model capabilities.
- [477] arXiv:2406.13925 [pdf, html, other]
-
Title: GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language ModelsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large Language Models (LLMs) are prone to generating content that exhibits gender biases, raising significant ethical concerns. Alignment, the process of fine-tuning LLMs to better align with desired behaviors, is recognized as an effective approach to mitigate gender biases. Although proprietary LLMs have made significant strides in mitigating gender bias, their alignment datasets are not publicly available. The commonly used and publicly available alignment dataset, HH-RLHF, still exhibits gender bias to some extent. There is a lack of publicly available alignment datasets specifically designed to address gender bias. Hence, we developed a new dataset named GenderAlign, aiming at mitigating a comprehensive set of gender biases in LLMs. This dataset comprises 8k single-turn dialogues, each paired with a "chosen" and a "rejected" response. Compared to the "rejected" responses, the "chosen" responses demonstrate lower levels of gender bias and higher quality. Furthermore, we categorized the gender biases in the "rejected" responses of GenderAlign into 4 principal categories. The experimental results show the effectiveness of GenderAlign in reducing gender bias in LLMs.
- [478] arXiv:2406.13928 [pdf, html, other]
-
Title: Optimal deep learning of holomorphic operators between Banach spacesSubjects: Machine Learning (cs.LG); Numerical Analysis (math.NA)
Operator learning problems arise in many key areas of scientific computing where Partial Differential Equations (PDEs) are used to model physical systems. In such scenarios, the operators map between Banach or Hilbert spaces. In this work, we tackle the problem of learning operators between Banach spaces, in contrast to the vast majority of past works considering only Hilbert spaces. We focus on learning holomorphic operators - an important class of problems with many applications. We combine arbitrary approximate encoders and decoders with standard feedforward Deep Neural Network (DNN) architectures - specifically, those with constant width exceeding the depth - under standard $\ell^2$-loss minimization. We first identify a family of DNNs such that the resulting Deep Learning (DL) procedure achieves optimal generalization bounds for such operators. For standard fully-connected architectures, we then show that there are uncountably many minimizers of the training problem that yield equivalent optimal performance. The DNN architectures we consider are `problem agnostic', with width and depth only depending on the amount of training data $m$ and not on regularity assumptions of the target operator. Next, we show that DL is optimal for this problem: no recovery procedure can surpass these generalization bounds up to log terms. Finally, we present numerical results demonstrating the practical performance on challenging problems including the parametric diffusion, Navier-Stokes-Brinkman and Boussinesq PDEs.
- [479] arXiv:2406.13929 [pdf, html, other]
-
Title: Large Language Models are Skeptics: False Negative Problem of Input-conflicting HallucinationComments: 12 pages, 9 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
In this paper, we identify a new category of bias that induces input-conflicting hallucinations, where large language models (LLMs) generate responses inconsistent with the content of the input context. This issue we have termed the false negative problem refers to the phenomenon where LLMs are predisposed to return negative judgments when assessing the correctness of a statement given the context. In experiments involving pairs of statements that contain the same information but have contradictory factual directions, we observe that LLMs exhibit a bias toward false negatives. Specifically, the model presents greater overconfidence when responding with False. Furthermore, we analyze the relationship between the false negative problem and context and query rewriting and observe that both effectively tackle false negatives in LLMs.
- [480] arXiv:2406.13930 [pdf, html, other]
-
Title: Soft-QMIX: Integrating Maximum Entropy For Monotonic Value Function FactorizationSubjects: Machine Learning (cs.LG)
Multi-agent reinforcement learning (MARL) tasks often utilize a centralized training with decentralized execution (CTDE) framework. QMIX is a successful CTDE method that learns a credit assignment function to derive local value functions from a global value function, defining a deterministic local policy. However, QMIX is hindered by its poor exploration strategy. While maximum entropy reinforcement learning (RL) promotes better exploration through stochastic policies, QMIX's process of credit assignment conflicts with the maximum entropy objective and the decentralized execution requirement, making it unsuitable for maximum entropy RL. In this paper, we propose an enhancement to QMIX by incorporating an additional local Q-value learning method within the maximum entropy RL framework. Our approach constrains the local Q-value estimates to maintain the correct ordering of all actions. Due to the monotonicity of the QMIX value function, these updates ensure that locally optimal actions align with globally optimal actions. We theoretically prove the monotonic improvement and convergence of our method to an optimal solution. Experimentally, we validate our algorithm in matrix games, Multi-Agent Particle Environment and demonstrate state-of-the-art performance in SMAC-v2.
- [481] arXiv:2406.13931 [pdf, html, other]
-
Title: Dissipativeness of the hyperbolic quadrature method of moments for kinetic equationsComments: 23 pagesSubjects: Numerical Analysis (math.NA)
This paper presents a dissipativeness analysis of a quadrature method of moments (called HyQMOM) for the one-dimensional BGK equation. The method has exhibited its good performance in numerous applications. However, its mathematical foundation has not been clarified. Here we present an analytical proof of the strict hyperbolicity of the HyQMOM-induced moment closure systems by introducing a polynomial-based closure technique. As a byproduct, a class of numerical schemes for the HyQMOM system is shown to be realizability preserving under CFL-type conditions. We also show that the system preserves the dissipative properties of the kinetic equation by verifying a certain structural stability condition. The proof uses a newly introduced affine invariance and the homogeneity of the HyQMOM and heavily relies on the theory of orthogonal polynomials associated with realizable moments, in particular, the moments of the standard normal distribution.
- [482] arXiv:2406.13933 [pdf, html, other]
-
Title: EnTruth: Enhancing the Traceability of Unauthorized Dataset Usage in Text-to-image Diffusion Models with Minimal and Robust AlterationsSubjects: Cryptography and Security (cs.CR)
Generative models, especially text-to-image diffusion models, have significantly advanced in their ability to generate images, benefiting from enhanced architectures, increased computational power, and large-scale datasets. While the datasets play an important role, their protection has remained as an unsolved issue. Current protection strategies, such as watermarks and membership inference, are either in high poison rate which is detrimental to image quality or suffer from low accuracy and robustness. In this work, we introduce a novel approach, EnTruth, which Enhances Traceability of unauthorized dataset usage utilizing template memorization. By strategically incorporating the template memorization, EnTruth can trigger the specific behavior in unauthorized models as the evidence of infringement. Our method is the first to investigate the positive application of memorization and use it for copyright protection, which turns a curse into a blessing and offers a pioneering perspective for unauthorized usage detection in generative models. Comprehensive experiments are provided to demonstrate its effectiveness in terms of data-alteration rate, accuracy, robustness and generation quality.
- [483] arXiv:2406.13934 [pdf, html, other]
-
Title: Reasoning Like a Doctor: Improving Medical Dialogue Systems via Diagnostic Reasoning Process AlignmentComments: Accepted by ACL 2024 FindingsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Medical dialogue systems have attracted significant attention for their potential to act as medical assistants. Enabling these medical systems to emulate clinicians' diagnostic reasoning process has been the long-standing research focus. Previous studies rudimentarily realized the simulation of clinicians' diagnostic process by fine-tuning language models on high-quality dialogue datasets. Nonetheless, they overly focus on the outcomes of the clinician's reasoning process while ignoring their internal thought processes and alignment with clinician preferences. Our work aims to build a medical dialogue system that aligns with clinicians' diagnostic reasoning processes. We propose a novel framework, Emulation, designed to generate an appropriate response that relies on abductive and deductive diagnostic reasoning analyses and aligns with clinician preferences through thought process modeling. Experimental results on two datasets confirm the efficacy of Emulation. Crucially, our framework furnishes clear explanations for the generated responses, enhancing its transparency in medical consultations.
- [484] arXiv:2406.13939 [pdf, html, other]
-
Title: 2nd Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video SegmentationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Motion Expression guided Video Segmentation is a challenging task that aims at segmenting objects in the video based on natural language expressions with motion descriptions. Unlike the previous referring video object segmentation (RVOS), this task focuses more on the motion in video content for language-guided video object segmentation, requiring an enhanced ability to model longer temporal, motion-oriented vision-language data. In this report, based on the RVOS methods, we successfully introduce mask information obtained from the video instance segmentation model as preliminary information for temporal enhancement and employ SAM for spatial refinement. Finally, our method achieved a score of 49.92 J &F in the validation phase and 54.20 J &F in the test phase, securing the final ranking of 2nd in the MeViS Track at the CVPR 2024 PVUW Challenge.
- [485] arXiv:2406.13940 [pdf, html, other]
-
Title: AutoCAP: Towards Automatic Cross-lingual Alignment Planning for Zero-shot Chain-of-ThoughtComments: Accepted by ACL2024 FindingsSubjects: Computation and Language (cs.CL)
Cross-lingual chain-of-thought can effectively complete reasoning tasks across languages, which gains increasing attention. Recently, dominant approaches in the literature improve cross-lingual alignment capabilities by integrating reasoning knowledge from different languages. Despite achieving excellent performance, current methods still have two main challenges: (1) Manual language specification: They still highly rely on manually selecting the languages to integrate, severely affecting their generalizability; (2) Static weight allocation: Current methods simply integrate all languages equally. In fact, different language reasoning paths should have different weights to achieve better complementation and integration. Motivated by this, we introduce an Automatic Cross-lingual Alignment Planning (AutoCAP) for zero-shot chain-of-thought to address the above challenges. The core of AutoCAP consists of two components: (1) Automatic Language Selection Prompting to guide LLMs to select appropriate languages and (2) Automatic Weight Allocation Prompting to automatically allocate alignment weight scores to each reasoning path. Extensive experiments on several benchmarks reveal that AutoCAP achieves state-of-the-art performance, surpassing previous methods that required manual effort.
- [486] arXiv:2406.13941 [pdf, html, other]
-
Title: UpDLRM: Accelerating Personalized Recommendation using Real-World PIM ArchitectureSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Deep Learning Recommendation Models (DLRMs) have gained popularity in recommendation systems due to their effectiveness in handling large-scale recommendation tasks. The embedding layers of DLRMs have become the performance bottleneck due to their intensive needs on memory capacity and memory bandwidth. In this paper, we propose UpDLRM, which utilizes real-world processingin-memory (PIM) hardware, UPMEM DPU, to boost the memory bandwidth and reduce recommendation latency. The parallel nature of the DPU memory can provide high aggregated bandwidth for the large number of irregular memory accesses in embedding lookups, thus offering great potential to reduce the inference latency. To fully utilize the DPU memory bandwidth, we further studied the embedding table partitioning problem to achieve good workload-balance and efficient data caching. Evaluations using real-world datasets show that, UpDLRM achieves much lower inference time for DLRM compared to both CPU-only and CPU-GPU hybrid counterparts.
- [487] arXiv:2406.13942 [pdf, html, other]
-
Title: Synthesizing Multimodal Electronic Health Records via Predictive Diffusion ModelsYuan Zhong, Xiaochen Wang, Jiaqi Wang, Xiaokun Zhang, Yaqing Wang, Mengdi Huai, Cao Xiao, Fenglong MaSubjects: Machine Learning (cs.LG)
Synthesizing electronic health records (EHR) data has become a preferred strategy to address data scarcity, improve data quality, and model fairness in healthcare. However, existing approaches for EHR data generation predominantly rely on state-of-the-art generative techniques like generative adversarial networks, variational autoencoders, and language models. These methods typically replicate input visits, resulting in inadequate modeling of temporal dependencies between visits and overlooking the generation of time information, a crucial element in EHR data. Moreover, their ability to learn visit representations is limited due to simple linear mapping functions, thus compromising generation quality. To address these limitations, we propose a novel EHR data generation model called EHRPD. It is a diffusion-based model designed to predict the next visit based on the current one while also incorporating time interval estimation. To enhance generation quality and diversity, we introduce a novel time-aware visit embedding module and a pioneering predictive denoising diffusion probabilistic model (PDDPM). Additionally, we devise a predictive U-Net (PU-Net) to optimize P-DDPM.We conduct experiments on two public datasets and evaluate EHRPD from fidelity, privacy, and utility perspectives. The experimental results demonstrate the efficacy and utility of the proposed EHRPD in addressing the aforementioned limitations and advancing EHR data generation.
- [488] arXiv:2406.13943 [pdf, html, other]
-
Title: New QEC codes and EAQEC codes from repeated-root cyclic codes of length $2^rp^s$Subjects: Information Theory (cs.IT)
Let $p$ be an odd prime and $r,s,m$ be positive integers. In this study, we initiate our exploration by delving into the intricate structure of all repeated-root cyclic codes and their duals with a length of $2^rp^s$ over the finite field $\mathbb{F}_{p^m}$. Through the utilization of CSS and Steane's constructions, a series of new quantum error-correcting (QEC) codes are constructed with parameters distinct from all previous constructions. Furthermore, we provide all maximum distance separable (MDS) cyclic codes of length $2^rp^s$, which are further utilized in the construction of QEC MDS codes. Finally, we introduce a significant number of novel entanglement-assisted quantum error-correcting (EAQEC) codes derived from these repeated-root cyclic codes. Notably, these newly constructed codes exhibit parameters distinct from those of previously known constructions.
- [489] arXiv:2406.13945 [pdf, html, other]
-
Title: CityBench: Evaluating the Capabilities of Large Language Model as World ModelJie Feng, Jun Zhang, Junbo Yan, Xin Zhang, Tianjian Ouyang, Tianhui Liu, Yuwei Du, Siqi Guo, Yong LiSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Large language models (LLMs) with powerful generalization ability has been widely used in many domains. A systematic and reliable evaluation of LLMs is a crucial step in their development and applications, especially for specific professional fields. In the urban domain, there have been some early explorations about the usability of LLMs, but a systematic and scalable evaluation benchmark is still lacking. The challenge in constructing a systematic evaluation benchmark for the urban domain lies in the diversity of data and scenarios, as well as the complex and dynamic nature of cities. In this paper, we propose CityBench, an interactive simulator based evaluation platform, as the first systematic evaluation benchmark for the capability of LLMs for urban domain. First, we build CitySim to integrate the multi-source data and simulate fine-grained urban dynamics. Based on CitySim, we design 7 tasks in 2 categories of perception-understanding and decision-making group to evaluate the capability of LLMs as city-scale world model for urban domain. Due to the flexibility and ease-of-use of CitySim, our evaluation platform CityBench can be easily extended to any city in the world. We evaluate 13 well-known LLMs including open source LLMs and commercial LLMs in 13 cities around the world. Extensive experiments demonstrate the scalability and effectiveness of proposed CityBench and shed lights for the future development of LLMs in urban domain. The dataset, benchmark and source codes are openly accessible to the research community via this https URL
- [490] arXiv:2406.13947 [pdf, html, other]
-
Title: AspirinSum: an Aspect-based utility-preserved de-identification Summarization frameworkSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Due to the rapid advancement of Large Language Model (LLM), the whole community eagerly consumes any available text data in order to train the LLM. Currently, large portion of the available text data are collected from internet, which has been thought as a cheap source of the training data. However, when people try to extend the LLM's capability to the personal related domain, such as healthcare or education, the lack of public dataset in these domains make the adaption of the LLM in such domains much slower. The reason of lacking public available dataset in such domains is because they usually contain personal sensitive information. In order to comply with privacy law, the data in such domains need to be de-identified before any kind of dissemination. It had been much research tried to address this problem for the image or tabular data. However, there was limited research on the efficient and general de-identification method for text data. Most of the method based on human annotation or predefined category list. It usually can not be easily adapted to specific domains. The goal of this proposal is to develop a text de-identification framework, which can be easily adapted to the specific domain, leverage the existing expert knowledge without further human annotation. We propose an aspect-based utility-preserved de-identification summarization framework, AspirinSum, by learning to align expert's aspect from existing comment data, it can efficiently summarize the personal sensitive document by extracting personal sensitive aspect related sub-sentence and de-identify it by substituting it with similar aspect sub-sentence. We envision that the de-identified text can then be used in data publishing, eventually publishing our de-identified dataset for downstream task use.
- [491] arXiv:2406.13948 [pdf, other]
-
Title: CityGPT: Empowering Urban Spatial Cognition of Large Language ModelsSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Large language models(LLMs) with powerful language generation and reasoning capabilities have already achieved success in many domains, e.g., math and code generation. However, due to the lacking of physical world's corpus and knowledge during training, they usually fail to solve many real-life tasks in the urban space. In this paper, we propose CityGPT, a systematic framework for enhancing the capability of LLMs on understanding urban space and solving the related urban tasks by building a city-scale world model in the model. First, we construct a diverse instruction tuning dataset CityInstruction for injecting urban knowledge and enhancing spatial reasoning capability effectively. By using a mixture of CityInstruction and general instruction data, we fine-tune various LLMs (e.g., ChatGLM3-6B, Qwen1.5 and LLama3 series) to enhance their capability without sacrificing general abilities. To further validate the effectiveness of proposed methods, we construct a comprehensive benchmark CityEval to evaluate the capability of LLMs on diverse urban scenarios and problems. Extensive evaluation results demonstrate that small LLMs trained with CityInstruction can achieve competitive performance with commercial LLMs in the comprehensive evaluation of CityEval. The source codes are openly accessible to the research community via this https URL.
- [492] arXiv:2406.13951 [pdf, html, other]
-
Title: Towards the in-situ Trunk Identification and Length Measurement of Sea Cucumbers via B\'{e}zier Curve ModellingSubjects: Computer Vision and Pattern Recognition (cs.CV)
We introduce a novel vision-based framework for in-situ trunk identification and length measurement of sea cucumbers, which plays a crucial role in the monitoring of marine ranching resources and mechanized harvesting. To model sea cucumber trunk curves with varying degrees of bending, we utilize the parametric Bézier curve due to its computational simplicity, stability, and extensive range of transformation possibilities. Then, we propose an end-to-end unified framework that combines parametric Bézier curve modeling with the widely used You-Only-Look-Once (YOLO) pipeline, abbreviated as TISC-Net, and incorporates effective funnel activation and efficient multi-scale attention modules to enhance curve feature perception and learning. Furthermore, we propose incorporating trunk endpoint loss as an additional constraint to effectively mitigate the impact of endpoint deviations on the overall curve. Finally, by utilizing the depth information of pixels located along the trunk curve captured by a binocular camera, we propose accurately estimating the in-situ length of sea cucumbers through space curve integration. We established two challenging benchmark datasets for curve-based in-situ sea cucumber trunk identification. These datasets consist of over 1,000 real-world marine environment images of sea cucumbers, accompanied by Bézier format annotations. We conduct evaluation on SC-ISTI, for which our method achieves mAP50 above 0.9 on both object detection and trunk identification tasks. Extensive length measurement experiments demonstrate that the average absolute relative error is around 0.15.
- [493] arXiv:2406.13954 [pdf, other]
-
Title: Research on Flight Accidents Prediction based Back Propagation Neural NetworkSubjects: Artificial Intelligence (cs.AI)
With the rapid development of civil aviation and the significant improvement of people's living standards, taking an air plane has become a common and efficient way of travel. However, due to the flight characteris-tics of the aircraft and the sophistication of the fuselage structure, flight de-lays and flight accidents occur from time to time. In addition, the life risk factor brought by aircraft after an accident is also the highest among all means of transportation. In this work, a model based on back-propagation neural network was used to predict flight accidents. By collecting historical flight data, including a variety of factors such as meteorological conditions, aircraft technical condition, and pilot experience, we trained a backpropaga-tion neural network model to identify potential accident risks. In the model design, a multi-layer perceptron structure is used to optimize the network performance by adjusting the number of hidden layer nodes and the learning rate. Experimental analysis shows that the model can effectively predict flight accidents with high accuracy and reliability.
- [494] arXiv:2406.13960 [pdf, html, other]
-
Title: Evolving to be Your Soulmate: Personalized Dialogue Agents with Dynamically Adapted PersonasComments: Work in progressSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Previous research on persona-based dialogue agents typically preset the agent's persona before deployment, which remains static thereafter. In this paper, we take a step further and explore a new paradigm called Self-evolving Personalized Dialogue Agents (SPDA), where the agent continuously evolves during the conversation to better align with the user's anticipation by dynamically adapting its persona. This paradigm could enable better personalization for each user, but also introduce unique challenges, which mainly lie in the process of persona adaptation. Two key issues include how to achieve persona alignment with the user and how to ensure smooth transition in the adaptation process. To address them, we propose a novel framework that refines the persona at hierarchical levels to progressively align better with the user in a controllable way. Experiments show that integrating the personas adapted by our framework consistently enhances personalization and overall dialogue performance across various base systems.
- [495] arXiv:2406.13961 [pdf, html, other]
-
Title: Equivariant Offline Reinforcement LearningSubjects: Machine Learning (cs.LG); Robotics (cs.RO)
Sample efficiency is critical when applying learning-based methods to robotic manipulation due to the high cost of collecting expert demonstrations and the challenges of on-robot policy learning through online Reinforcement Learning (RL). Offline RL addresses this issue by enabling policy learning from an offline dataset collected using any behavioral policy, regardless of its quality. However, recent advancements in offline RL have predominantly focused on learning from large datasets. Given that many robotic manipulation tasks can be formulated as rotation-symmetric problems, we investigate the use of $SO(2)$-equivariant neural networks for offline RL with a limited number of demonstrations. Our experimental results show that equivariant versions of Conservative Q-Learning (CQL) and Implicit Q-Learning (IQL) outperform their non-equivariant counterparts. We provide empirical evidence demonstrating how equivariance improves offline learning algorithms in the low-data regime.
- [496] arXiv:2406.13963 [pdf, html, other]
-
Title: SSAD: Self-supervised Auxiliary Detection Framework for Panoramic X-ray based Dental Disease DiagnosisZijian Cai, Xinquan Yang, Xuguang Li, Xiaoling Luo, Xuechen Li, Linlin Shen, He Meng, Yongqiang DengSubjects: Computer Vision and Pattern Recognition (cs.CV)
Panoramic X-ray is a simple and effective tool for diagnosing dental diseases in clinical practice. When deep learning models are developed to assist dentist in interpreting panoramic X-rays, most of their performance suffers from the limited annotated data, which requires dentist's expertise and a lot of time cost. Although self-supervised learning (SSL) has been proposed to address this challenge, the two-stage process of pretraining and fine-tuning requires even more training time and computational resources. In this paper, we present a self-supervised auxiliary detection (SSAD) framework, which is plug-and-play and compatible with any detectors. It consists of a reconstruction branch and a detection branch. Both branches are trained simultaneously, sharing the same encoder, without the need for finetuning. The reconstruction branch learns to restore the tooth texture of healthy or diseased teeth, while the detection branch utilizes these learned features for diagnosis. To enhance the encoder's ability to capture fine-grained features, we incorporate the image encoder of SAM to construct a texture consistency (TC) loss, which extracts image embedding from the input and output of reconstruction branch, and then enforces both embedding into the same feature space. Extensive experiments on the public DENTEX dataset through three detection tasks demonstrate that the proposed SSAD framework achieves state-of-the-art performance compared to mainstream object detection methods and SSL methods. The code is available at this https URL
- [497] arXiv:2406.13964 [pdf, html, other]
-
Title: Hierarchical Micro-Segmentations for Zero-Trust Services via Large Language Model (LLM)-enhanced Graph DiffusionYinqiu Liu, Guangyuan Liu, Hongyang Du, Dusit Niyato, Jiawen Kang, Zehui Xiong, Dong In Kim, Xuemin ShenComments: 13 pagesSubjects: Networking and Internet Architecture (cs.NI)
In the rapidly evolving Next-Generation Networking (NGN) era, the adoption of zero-trust architectures has become increasingly crucial to protect security. However, provisioning zero-trust services in NGNs poses significant challenges, primarily due to the environmental complexity and dynamics. Motivated by these challenges, this paper explores efficient zero-trust service provisioning using hierarchical micro-segmentations. Specifically, we model zero-trust networks via hierarchical graphs, thereby jointly considering the resource- and trust-level features to optimize service efficiency. We organize such zero-trust networks through micro-segmentations, which support granular zero-trust policies efficiently. To generate the optimal micro-segmentation, we present the Large Language Model-Enhanced Graph Diffusion (LEGD) algorithm, which leverages the diffusion process to realize a high-quality generation paradigm. Additionally, we utilize policy boosting and Large Language Models (LLM) to enable LEGD to optimize the generation policy and understand complicated graphical features. Moreover, realizing the unique trustworthiness updates or service upgrades in zero-trust NGN, we further present LEGD-Adaptive Maintenance (LEGD-AM), providing an adaptive way to perform task-oriented fine-tuning on LEGD. Extensive experiments demonstrate that the proposed LEGD achieves 90% higher efficiency in provisioning services compared with other baselines. Moreover, the LEGD-AM can reduce the service outage time by over 50%.
- [498] arXiv:2406.13966 [pdf, html, other]
-
Title: Causal Inference with Latent Variables: Recent Advances and Future ProspectivesComments: Accepted by KDD'24 Survey TrackSubjects: Machine Learning (cs.LG); Methodology (stat.ME)
Causality lays the foundation for the trajectory of our world. Causal inference (CI), which aims to infer intrinsic causal relations among variables of interest, has emerged as a crucial research topic. Nevertheless, the lack of observation of important variables (e.g., confounders, mediators, exogenous variables, etc.) severely compromises the reliability of CI methods. The issue may arise from the inherent difficulty in measuring the variables. Additionally, in observational studies where variables are passively recorded, certain covariates might be inadvertently omitted by the experimenter. Depending on the type of unobserved variables and the specific CI task, various consequences can be incurred if these latent variables are carelessly handled, such as biased estimation of causal effects, incomplete understanding of causal mechanisms, lack of individual-level causal consideration, etc. In this survey, we provide a comprehensive review of recent developments in CI with latent variables. We start by discussing traditional CI techniques when variables of interest are assumed to be fully observed. Afterward, under the taxonomy of circumvention and inference-based methods, we provide an in-depth discussion of various CI strategies to handle latent variables, covering the tasks of causal effect estimation, mediation analysis, counterfactual reasoning, and causal discovery. Furthermore, we generalize the discussion to graph data where interference among units may exist. Finally, we offer fresh aspects for further advancement of CI with latent variables, especially new opportunities in the era of large language models (LLMs).
- [499] arXiv:2406.13968 [pdf, html, other]
-
Title: Recent Advances in Traffic Accident Analysis and Prediction: A Comprehensive Review of Machine Learning TechniquesComments: A review paper, 26 pagesSubjects: Machine Learning (cs.LG)
Traffic accidents pose a severe global public health issue, leading to 1.19 million fatalities annually, with the greatest impact on individuals aged 5 to 29 years old. This paper addresses the critical need for advanced predictive methods in road safety by conducting a comprehensive review of recent advancements in applying machine learning (ML) techniques to traffic accident analysis and prediction. It examines 191 studies from the last five years, focusing on predicting accident risk, frequency, severity, duration, as well as general statistical analysis of accident data. To our knowledge, this study is the first to provide such a comprehensive review, covering the state-of-the-art across a wide range of domains related to accident analysis and prediction. The review highlights the effectiveness of integrating diverse data sources and advanced ML techniques to improve prediction accuracy and handle the complexities of traffic data. By mapping the current landscape and identifying gaps in the literature, this study aims to guide future research towards significantly reducing traffic-related deaths and injuries by 2030, aligning with the World Health Organization (WHO) targets.
- [500] arXiv:2406.13971 [pdf, html, other]
-
Title: Complex fractal trainability boundary can arise from trivial non-convexityComments: 11 pages, 9 figures, preliminary testsSubjects: Machine Learning (cs.LG); Dynamical Systems (math.DS); Chaotic Dynamics (nlin.CD)
Training neural networks involves optimizing parameters to minimize a loss function, where the nature of the loss function and the optimization strategy are crucial for effective training. Hyperparameter choices, such as the learning rate in gradient descent (GD), significantly affect the success and speed of convergence. Recent studies indicate that the boundary between bounded and divergent hyperparameters can be fractal, complicating reliable hyperparameter selection. However, the nature of this fractal boundary and methods to avoid it remain unclear. In this study, we focus on GD to investigate the loss landscape properties that might lead to fractal trainability boundaries. We discovered that fractal boundaries can emerge from simple non-convex perturbations, i.e., adding or multiplying cosine type perturbations to quadratic functions. The observed fractal dimensions are influenced by factors like parameter dimension, type of non-convexity, perturbation wavelength, and perturbation amplitude. Our analysis identifies "roughness of perturbation", which measures the gradient's sensitivity to parameter changes, as the factor controlling fractal dimensions of trainability boundaries. We observed a clear transition from non-fractal to fractal trainability boundaries as roughness increases, with the critical roughness causing the perturbed loss function non-convex. Thus, we conclude that fractal trainability boundaries can arise from very simple non-convexity. We anticipate that our findings will enhance the understanding of complex behaviors during neural network training, leading to more consistent and predictable training strategies.
- [501] arXiv:2406.13972 [pdf, html, other]
-
Title: CREF: An LLM-based Conversational Software Repair Framework for Programming TutorsBoyang Yang, Haoye Tian, Weiguo Pian, Haoran Yu, Haitao Wang, Jacques Klein, Tegawendé F. Bissyandé, Shunfu JinSubjects: Software Engineering (cs.SE)
Program repair techniques offer cost-saving benefits for debugging within software development and programming education scenarios. With the proven effectiveness of Large Language Models (LLMs) in code-related tasks, researchers have explored their potential for program repair. However, it is crucial to recognize that existing repair benchmarks may have influenced LLM training data, potentially causing data leakage. To evaluate LLMs' realistic repair capabilities, (1) we introduce an extensive, non-crawled benchmark, referred to as TutorCode, comprising 1,239 C++ defect codes and associated information such as tutor guidance, solution description, failing test cases, and the corrected code. Our work assesses the repair performance of 12 LLMs on TutorCode, measuring repair correctness (TOP-5 and AVG-5) and patch precision (RPSR). (2) We then provide a comprehensive investigation into which types of extra information can help LLMs improve their performance in repairing defects. Among these types, tutor guidance was found to be the most effective information in enhancing LLM repair capabilities. To fully harness LLMs' conversational capabilities and the benefits of augmented information, (3) we introduce a novel conversational semi-automatic repair framework CREF assisting human tutor. It demonstrates a remarkable AVG-5 improvement of 17.2%-24.6% compared to the baseline, achieving an impressive AVG-5 of 76.6% when utilizing GPT-4. These results highlight the potential for enhancing LLMs' repair capabilities through interactions with tutors and historical conversations involving incorrect responses. The successful application of CREF in a real-world educational setting demonstrates its effectiveness in reducing tutors' workload and improving students' learning experience, while also showcasing its promise for facilitating other software engineering tasks, such as code review.
- [502] arXiv:2406.13975 [pdf, html, other]
-
Title: MR-BEN: A Comprehensive Meta-Reasoning Benchmark for Large Language ModelsZhongshen Zeng, Yinhong Liu, Yingjia Wan, Jingyao Li, Pengguang Chen, Jianbo Dai, Yuxuan Yao, Rongwu Xu, Zehan Qi, Wanru Zhao, Linling Shen, Jianqiao Lu, Haochen Tan, Yukang Chen, Hao Zhang, Zhan Shi, Bailin Wang, Zhijiang Guo, Jiaya JiaSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making, largely based on the step-by-step chain-of-thought reasoning processes. However, it has been increasingly challenging to evaluate the reasoning capability of LLMs. Concretely, existing outcome-based benchmarks begin to saturate and become less sufficient to monitor the progress. To this end, we present a process-based benchmark MR-BEN that demands a meta reasoning skill, where LMs are asked to locate and analyse potential errors in automatically generated reasoning steps. MR-BEN is a comprehensive benchmark comprising 5,975 questions collected from human experts, covering various subjects such as physics, chemistry, logic, coding, and more. Through our designed metrics for assessing meta-reasoning on this benchmark, we identify interesting limitations and weaknesses of current LLMs (open-source and closed-source models). For example, open-source models are seemingly comparable to GPT-4 on outcome-based benchmarks, but they lag far behind on our benchmark, revealing the underlying reasoning capability gap between them. Our dataset and codes are available on this https URL.
- [503] arXiv:2406.13982 [pdf, html, other]
-
Title: Improved Remixing Process for Domain Adaptation-Based Speech Enhancement by Mitigating Data Imbalance in Signal-to-Noise RatioComments: Accepted at Interspeech2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
RemixIT and Remixed2Remixed are domain adaptation-based speech enhancement (DASE) methods that use a teacher model trained in full supervision to generate pseudo-paired data by remixing the outputs of the teacher model. The student model for enhancing real-world recorded signals is trained using the pseudo-paired data without ground truth. Since the noisy signals are recorded in natural environments, the dataset inevitably suffers data imbalance in some acoustic properties, leading to subpar performance for the underrepresented data. The signal-to-noise ratio (SNR), inherently balanced in supervised learning, is a prime example. In this paper, we provide empirical evidence that the SNR of pseudo data has a significant impact on model performance using the dataset of the CHiME-7 UDASE task, highlighting the importance of balanced SNR in DASE. Furthermore, we propose adopting curriculum learning to encompass a broad range of SNRs to boost performance for underrepresented data.
- [504] arXiv:2406.13983 [pdf, html, other]
-
Title: Barter Exchange with Shared Item ValuationsComments: A previous version of this work appeared in the proceedings of WWW '24Subjects: Data Structures and Algorithms (cs.DS)
In barter exchanges agents enter seeking to swap their items for other items on their wishlist. We consider a centralized barter exchange with a set of agents and items where each item has a positive value. The goal is to compute a (re)allocation of items maximizing the agents' collective utility subject to each agent's total received value being comparable to their total given value. Many such centralized barter exchanges exist and serve crucial roles; e.g., kidney exchange programs, which are often formulated as variants of directed cycle packing. We show finding a reallocation where each agent's total given and total received values are equal is NP-hard. On the other hand, we develop a randomized algorithm that achieves optimal utility in expectation and where, i) for any agent, with probability 1 their received value is at least their given value minus $v^*$ where $v^*$ is said agent's most valuable owned and wished-for item, and ii) each agent's given and received values are equal in expectation.
- [505] arXiv:2406.13984 [pdf, html, other]
-
Title: Reducing Memory Contention and I/O Congestion for Disk-based GNN TrainingComments: This is a full version for the paper with almost the same title accepted by the 53rd International Conference on Parallel Processing (ICPP 2024)Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Graph neural networks (GNNs) gain wide popularity. Large graphs with high-dimensional features become common and training GNNs on them is non-trivial on an ordinary machine. Given a gigantic graph, even sample-based GNN training cannot work efficiently, since it is difficult to keep the graph's entire data in memory during the training process. Leveraging a solid-state drive (SSD) or other storage devices to extend the memory space has been studied in training GNNs. Memory and I/Os are hence critical for effectual disk-based training. We find that state-of-the-art (SoTA) disk-based GNN training systems severely suffer from issues like the memory contention between a graph's topological and feature data, and severe I/O congestion upon loading data from SSD for training. We accordingly develop GNNDrive. GNNDrive 1) minimizes the memory footprint with holistic buffer management across sampling and extracting, and 2) avoids I/O congestion through a strategy of asynchronous feature extraction. It also avoids costly data preparation on the critical path and makes the most of software and hardware resources. Experiments show that GNNDrive achieves superior performance. For example, when training with the Papers100M dataset and GraphSAGE model, GNNDrive is faster than SoTA PyG+, Ginex, and MariusGNN by 16.9x, 2.6x, and 2.7x, respectively.
- [506] arXiv:2406.13985 [pdf, html, other]
-
Title: The Elusive Pursuit of Replicating PATE-GAN: Benchmarking, Auditing, DebuggingSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Synthetic data created by differentially private (DP) generative models is increasingly used in real-world settings. In this context, PATE-GAN has emerged as a popular algorithm, combining Generative Adversarial Networks (GANs) with the private training approach of PATE (Private Aggregation of Teacher Ensembles). In this paper, we analyze and benchmark six open-source PATE-GAN implementations, including three by (a subset of) the original authors. First, we shed light on architecture deviations and empirically demonstrate that none replicate the utility performance reported in the original paper. Then, we present an in-depth privacy evaluation, including DP auditing, showing that all implementations leak more privacy than intended and uncovering 17 privacy violations and 5 other bugs. Our codebase is available from this https URL.
- [507] arXiv:2406.13987 [pdf, other]
-
Title: Image anomaly detection and prediction scheme based on SSA optimized ResNet50-BiGRU modelSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Image anomaly detection is a popular research direction, with many methods emerging in recent years due to rapid advancements in computing. The use of artificial intelligence for image anomaly detection has been widely studied. By analyzing images of athlete posture and movement, it is possible to predict injury status and suggest necessary adjustments. Most existing methods rely on convolutional networks to extract information from irrelevant pixel data, limiting model accuracy. This paper introduces a network combining Residual Network (ResNet) and Bidirectional Gated Recurrent Unit (BiGRU), which can predict potential injury types and provide early warnings by analyzing changes in muscle and bone poses from video images. To address the high complexity of this network, the Sparrow search algorithm was used for optimization. Experiments conducted on four datasets demonstrated that our model has the smallest error in image anomaly detection compared to other models, showing strong adaptability. This provides a new approach for anomaly detection and predictive analysis in images, contributing to the sustainable development of human health and performance.
- [508] arXiv:2406.13988 [pdf, html, other]
-
Title: LGmap: Local-to-Global Mapping Network for Online Long-Range Vectorized HD Map ConstructionSubjects: Computer Vision and Pattern Recognition (cs.CV)
This report introduces the first-place winning solution for the Autonomous Grand Challenge 2024 - Mapless Driving. In this report, we introduce a novel online mapping pipeline LGmap, which adept at long-range temporal model. Firstly, we propose symmetric view transformation(SVT), a hybrid view transformation module. Our approach overcomes the limitations of forward sparse feature representation and utilizing depth perception and SD prior information. Secondly, we propose hierarchical temporal fusion(HTF) module. It employs temporal information from local to global, which empowers the construction of long-range HD map with high stability. Lastly, we propose a novel ped-crossing resampling. The simplified ped crossing representation accelerates the instance attention based decoder convergence performance. Our method achieves 0.66 UniScore in the Mapless Driving OpenLaneV2 test set.
- [509] arXiv:2406.13990 [pdf, html, other]
-
Title: Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model EvaluationSubjects: Computation and Language (cs.CL)
The training process of large language models (LLMs) often involves varying degrees of test data contamination. Although current LLMs are achieving increasingly better performance on various benchmarks, their performance in practical applications does not always match their benchmark results. Leakage of benchmarks can prevent the accurate assessment of LLMs' true performance. However, constructing new benchmarks is costly, labor-intensive and still carries the risk of leakage. Therefore, in this paper, we ask the question, Can we reuse these leaked benchmarks for LLM evaluation? We propose Inference-Time Decontamination (ITD) to address this issue by detecting and rewriting leaked samples without altering their difficulties. ITD can mitigate performance inflation caused by memorizing leaked benchmarks. Our proof-of-concept experiments demonstrate that ITD reduces inflated accuracy by 22.9% on GSM8K and 19.0% on MMLU. On MMLU, using Inference-time Decontamination can lead to a decrease in the results of Phi3 and Mistral by 6.7% and 3.6% respectively. We hope that ITD can provide more truthful evaluation results for large language models.
- [510] arXiv:2406.13991 [pdf, html, other]
-
Title: Bayesian Inverse Reinforcement Learning for Non-Markovian RewardsSubjects: Machine Learning (cs.LG)
Inverse reinforcement learning (IRL) is the problem of inferring a reward function from expert behavior. There are several approaches to IRL, but most are designed to learn a Markovian reward. However, a reward function might be non-Markovian, depending on more than just the current state, such as a reward machine (RM). Although there has been recent work on inferring RMs, it assumes access to the reward signal, absent in IRL. We propose a Bayesian IRL (BIRL) framework for inferring RMs directly from expert behavior, requiring significant changes to the standard framework. We define a new reward space, adapt the expert demonstration to include history, show how to compute the reward posterior, and propose a novel modification to simulated annealing to maximize this posterior. We demonstrate that our method performs well when optimizing according to its inferred reward and compares favorably to an existing method that learns exclusively binary non-Markovian rewards.
- [511] arXiv:2406.13992 [pdf, html, other]
-
Title: Robust Cooperative Multi-Agent Reinforcement Learning:A Mean-Field Type Game PerspectiveComments: Accepted for publication in L4DC 2024Subjects: Multiagent Systems (cs.MA); Systems and Control (eess.SY)
In this paper, we study the problem of robust cooperative multi-agent reinforcement learning (RL) where a large number of cooperative agents with distributed information aim to learn policies in the presence of \emph{stochastic} and \emph{non-stochastic} uncertainties whose distributions are respectively known and unknown. Focusing on policy optimization that accounts for both types of uncertainties, we formulate the problem in a worst-case (minimax) framework, which is is intractable in general. Thus, we focus on the Linear Quadratic setting to derive benchmark solutions. First, since no standard theory exists for this problem due to the distributed information structure, we utilize the Mean-Field Type Game (MFTG) paradigm to establish guarantees on the solution quality in the sense of achieved Nash equilibrium of the MFTG. This in turn allows us to compare the performance against the corresponding original robust multi-agent control problem. Then, we propose a Receding-horizon Gradient Descent Ascent RL algorithm to find the MFTG Nash equilibrium and we prove a non-asymptotic rate of convergence. Finally, we provide numerical experiments to demonstrate the efficacy of our approach relative to a baseline algorithm.
- [512] arXiv:2406.13993 [pdf, html, other]
-
Title: Exploring Changes in Nation Perception with Nationality-Assigned Personas in LLMsSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Persona assignment has become a common strategy for customizing LLM use to particular tasks and contexts. In this study, we explore how perceptions of different nations change when LLMs are assigned specific nationality personas. We assign 193 different nationality personas (e.g., an American person) to four LLMs and examine how the LLM perceptions of countries change. We find that all LLM-persona combinations tend to favor Western European nations, though nation-personas push LLM behaviors to focus more on and view more favorably the nation-persona's own region. Eastern European, Latin American, and African nations are viewed more negatively by different nationality personas. Our study provides insight into how biases and stereotypes are realized within LLMs when adopting different national personas. In line with the "Blueprint for an AI Bill of Rights", our findings underscore the critical need for developing mechanisms to ensure LLMs uphold fairness and not over-generalize at a global scale.
- [513] arXiv:2406.13996 [pdf, html, other]
-
Title: Unifying Graph Convolution and Contrastive Learning in Collaborative FilteringSubjects: Information Retrieval (cs.IR)
Graph-based models and contrastive learning have emerged as prominent methods in Collaborative Filtering (CF). While many existing models in CF incorporate these methods in their design, there seems to be a limited depth of analysis regarding the foundational principles behind them. This paper bridges graph convolution, a pivotal element of graph-based models, with contrastive learning through a theoretical framework. By examining the learning dynamics and equilibrium of the contrastive loss, we offer a fresh lens to understand contrastive learning via graph theory, emphasizing its capability to capture high-order connectivity. Building on this analysis, we further show that the graph convolutional layers often used in graph-based models are not essential for high-order connectivity modeling and might contribute to the risk of oversmoothing. Stemming from our findings, we introduce Simple Contrastive Collaborative Filtering (SCCF), a simple and effective algorithm based on a naive embedding model and a modified contrastive loss. The efficacy of the algorithm is demonstrated through extensive experiments across four public datasets. The experiment code is available at \url{this https URL}. \end{abstract}
- [514] arXiv:2406.13997 [pdf, html, other]
-
Title: "Global is Good, Local is Bad?": Understanding Brand Bias in LLMsSubjects: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)
Many recent studies have investigated social biases in LLMs but brand bias has received little attention. This research examines the biases exhibited by LLMs towards different brands, a significant concern given the widespread use of LLMs in affected use cases such as product recommendation and market analysis. Biased models may perpetuate societal inequalities, unfairly favoring established global brands while marginalizing local ones. Using a curated dataset across four brand categories, we probe the behavior of LLMs in this space. We find a consistent pattern of bias in this space -- both in terms of disproportionately associating global brands with positive attributes and disproportionately recommending luxury gifts for individuals in high-income countries. We also find LLMs are subject to country-of-origin effects which may boost local brand preference in LLM outputs in specific contexts.
- [515] arXiv:2406.14004 [pdf, html, other]
-
Title: Do Not Wait: Learning Re-Ranking Model Without User Feedback At Serving Time in E-CommerceSubjects: Information Retrieval (cs.IR); Machine Learning (cs.LG)
Recommender systems have been widely used in e-commerce, and re-ranking models are playing an increasingly significant role in the domain, which leverages the inter-item influence and determines the final recommendation lists. Online learning methods keep updating a deployed model with the latest available samples to capture the shifting of the underlying data distribution in e-commerce. However, they depend on the availability of real user feedback, which may be delayed by hours or even days, such as item purchases, leading to a lag in model enhancement. In this paper, we propose a novel extension of online learning methods for re-ranking modeling, which we term LAST, an acronym for Learning At Serving Time. It circumvents the requirement of user feedback by using a surrogate model to provide the instructional signal needed to steer model improvement. Upon receiving an online request, LAST finds and applies a model modification on the fly before generating a recommendation result for the request. The modification is request-specific and transient. It means the modification is tailored to and only to the current request to capture the specific context of the request. After a request, the modification is discarded, which helps to prevent error propagation and stabilizes the online learning procedure since the predictions of the surrogate model may be inaccurate. Most importantly, as a complement to feedback-based online learning methods, LAST can be seamlessly integrated into existing online learning systems to create a more adaptive and responsive recommendation experience. Comprehensive experiments, both offline and online, affirm that LAST outperforms state-of-the-art re-ranking models.
- [516] arXiv:2406.14005 [pdf, html, other]
-
Title: Information Guided Regularization for Fine-tuning Language ModelsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The pretraining-fine-tuning paradigm has been the de facto strategy for transfer learning in modern language modeling. With the understanding that task adaptation in LMs is often a function of parameters shared across tasks, we argue that a more surgical approach to regularization needs to exist for smoother transfer learning. Towards this end, we investigate how the pretraining loss landscape is affected by these task-sensitive parameters through an information-theoretic lens. We then leverage the findings from our investigations to devise a novel approach to dropout for improved model regularization and better downstream generalization. This approach, named guided dropout, is both task & architecture agnostic and adds no computational overhead to the fine-tuning process. Through empirical evaluations, we showcase that our approach to regularization yields consistently better performance, even in scenarios of data paucity, compared to standardized baselines.
- [517] arXiv:2406.14008 [pdf, html, other]
-
Title: AMC: Access to Miss Correlation Prefetcher for Evolving Graph AnalyticsComments: 14 pages, 16 figuresSubjects: Hardware Architecture (cs.AR)
Modern memory hierarchies work well with applications that have good spatial locality. Evolving (dynamic) graphs are important applications widely used to model graphs and networks with edge and vertex changes. They exhibit irregular memory access patterns and suffer from a high miss ratio and long miss penalty. Prefetching can be employed to predict and fetch future demand misses. However, current hardware prefetchers can not efficiently predict for applications with irregular memory accesses. In evolving graph applications, vertices that do not change during graph changes exhibit the same access correlation patterns. Current temporal prefetchers use one-to-one or one-to-many correlation to exploit these patterns. Similar patterns are recorded in the same entry, which causes aliasing and can lead to poor prefetch accuracy and coverage. This work proposes a software-assisted hardware prefetcher for evolving graphs. The key idea is to record the correlations between a sequence of vertex accesses and the following misses and then prefetch when the same vertex access sequence occurs in the future. The proposed Access-to-Miss Correlation (AMC) prefetcher provides a lightweight programming interface to identify the data structures of interest and sets the iteration boundary to update the correlation table. For the evaluated applications, AMC achieves a geomean speedup of 1.5x as compared to the best-performing prefetcher in prior work (VLDP). AMC can achieve an average of 62% accuracy and coverage, whereas VLDP has an accuracy of 31% and coverage of 23%.
- [518] arXiv:2406.14012 [pdf, html, other]
-
Title: Seeing Through AI's Lens: Enhancing Human Skepticism Towards LLM-Generated Fake NewsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
LLMs offer valuable capabilities, yet they can be utilized by malicious users to disseminate deceptive information and generate fake news. The growing prevalence of LLMs poses difficulties in crafting detection approaches that remain effective across various text domains. Additionally, the absence of precautionary measures for AI-generated news on online social platforms is concerning. Therefore, there is an urgent need to improve people's ability to differentiate between news articles written by humans and those produced by LLMs. By providing cues in human-written and LLM-generated news, we can help individuals increase their skepticism towards fake LLM-generated news. This paper aims to elucidate simple markers that help individuals distinguish between articles penned by humans and those created by LLMs. To achieve this, we initially collected a dataset comprising 39k news articles authored by humans or generated by four distinct LLMs with varying degrees of fake. We then devise a metric named Entropy-Shift Authorship Signature (ESAS) based on the information theory and entropy principles. The proposed ESAS ranks terms or entities, like POS tagging, within news articles based on their relevance in discerning article authorship. We demonstrate the effectiveness of our metric by showing the high accuracy attained by a basic method, i.e., TF-IDF combined with logistic regression classifier, using a small set of terms with the highest ESAS score. Consequently, we introduce and scrutinize these top ESAS-ranked terms to aid individuals in strengthening their skepticism towards LLM-generated fake news.
- [519] arXiv:2406.14013 [pdf, html, other]
-
Title: A note on cyclic non-MDS matricesComments: 18 pagesSubjects: Cryptography and Security (cs.CR)
In $1998,$ Daemen {\it{ et al.}} introduced a circulant Maximum Distance Separable (MDS) matrix in the diffusion layer of the Rijndael block cipher, drawing significant attention to circulant MDS matrices. This block cipher is now universally acclaimed as the AES block cipher. In $2016,$ Liu and Sim introduced cyclic matrices by modifying the permutation of circulant matrices and established the existence of MDS property for orthogonal left-circulant matrices, a notable subclass within cyclic matrices. While circulant matrices have been well-studied in the literature, the properties of cyclic matrices are not. Back in $1961$, Friedman introduced $g$-circulant matrices which form a subclass of cyclic matrices. In this article, we first establish a permutation equivalence between a cyclic matrix and a circulant matrix. We explore properties of cyclic matrices similar to $g$-circulant matrices. Additionally, we determine the determinant of $g$-circulant matrices of order $2^d \times 2^d$ and prove that they cannot be simultaneously orthogonal and MDS over a finite field of characteristic $2$. Furthermore, we prove that this result holds for any cyclic matrix.
- [520] arXiv:2406.14014 [pdf, html, other]
-
Title: Feature Fusion Based on Mutual-Cross-Attention Mechanism for EEG Emotion RecognitionComments: The work has been accepted by MICCAI 2024. The uploaded one is preprint which has not undergone peer review (when applicable) or any post-submission improvements or corrections. The official DOI link will be provided once availableSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
An objective and accurate emotion diagnostic reference is vital to psychologists, especially when dealing with patients who are difficult to communicate with for pathological reasons. Nevertheless, current systems based on Electroencephalography (EEG) data utilized for sentiment discrimination have some problems, including excessive model complexity, mediocre accuracy, and limited interpretability. Consequently, we propose a novel and effective feature fusion mechanism named Mutual-Cross-Attention (MCA). Combining with a specially customized 3D Convolutional Neural Network (3D-CNN), this purely mathematical mechanism adeptly discovers the complementary relationship between time-domain and frequency-domain features in EEG data. Furthermore, the new designed Channel-PSD-DE 3D feature also contributes to the high performance. The proposed method eventually achieves 99.49% (valence) and 99.30% (arousal) accuracy on DEAP dataset.
- [521] arXiv:2406.14015 [pdf, html, other]
-
Title: CohortNet: Empowering Cohort Discovery for Interpretable Healthcare AnalyticsComments: 10 pages, 12 figuresSubjects: Machine Learning (cs.LG)
Cohort studies are of significant importance in the field of healthcare analysis. However, existing methods typically involve manual, labor-intensive, and expert-driven pattern definitions or rely on simplistic clustering techniques that lack medical relevance. Automating cohort studies with interpretable patterns has great potential to facilitate healthcare analysis but remains an unmet need in prior research efforts. In this paper, we propose a cohort auto-discovery model, CohortNet, for interpretable healthcare analysis, focusing on the effective identification, representation, and exploitation of cohorts characterized by medically meaningful patterns. CohortNet initially learns fine-grained patient representations by separately processing each feature, considering both individual feature trends and feature interactions at each time step. Subsequently, it classifies each feature into distinct states and employs a heuristic cohort exploration strategy to effectively discover substantial cohorts with concrete patterns. For each identified cohort, it learns comprehensive cohort representations with credible evidence through associated patient retrieval. Ultimately, given a new patient, CohortNet can leverage relevant cohorts with distinguished importance, which can provide a more holistic understanding of the patient's conditions. Extensive experiments on three real-world datasets demonstrate that it consistently outperforms state-of-the-art approaches and offers interpretable insights from diverse perspectives in a top-down fashion.
- [522] arXiv:2406.14017 [pdf, html, other]
-
Title: EAGER: Two-Stream Generative Recommender with Behavior-Semantic CollaborationYe Wang, Jiahao Xun, Mingjie Hong, Jieming Zhu, Tao Jin, Wang Lin, Haoyuan Li, Linjun Li, Yan Xia, Zhou Zhao, Zhenhua DongComments: Accepted by KDD 2024. Source code available at this https URLSubjects: Information Retrieval (cs.IR)
Generative retrieval has recently emerged as a promising approach to sequential recommendation, framing candidate item retrieval as an autoregressive sequence generation problem. However, existing generative methods typically focus solely on either behavioral or semantic aspects of item information, neglecting their complementary nature and thus resulting in limited effectiveness. To address this limitation, we introduce EAGER, a novel generative recommendation framework that seamlessly integrates both behavioral and semantic information. Specifically, we identify three key challenges in combining these two types of information: a unified generative architecture capable of handling two feature types, ensuring sufficient and independent learning for each type, and fostering subtle interactions that enhance collaborative information utilization. To achieve these goals, we propose (1) a two-stream generation architecture leveraging a shared encoder and two separate decoders to decode behavior tokens and semantic tokens with a confidence-based ranking strategy; (2) a global contrastive task with summary tokens to achieve discriminative decoding for each type of information; and (3) a semantic-guided transfer task designed to implicitly promote cross-interactions through reconstruction and estimation objectives. We validate the effectiveness of EAGER on four public benchmarks, demonstrating its superior performance compared to existing methods.
- [523] arXiv:2406.14020 [pdf, html, other]
-
Title: Leveraging eBPF and AI for Ransomware Nose OutComments: 7 pagesSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Networking and Internet Architecture (cs.NI)
In this work, we propose a two-phased approach for real-time detection and deterrence of ransomware. To achieve this, we leverage the capabilities of eBPF (Extended Berkeley Packet Filter) and artificial intelligence to develop both proactive and reactive methods. In the first phase, we utilize signature based detection, where we employ custom eBPF programs to trace the execution of new processes and perform hash-based analysis against a known ransomware dataset. In the second, we employ a behavior-based technique that focuses on monitoring the process activities using a custom eBPF program and the creation of ransom notes, a prominent indicator of ransomware activity through the use of Natural Language Processing (NLP). By leveraging low-level tracing capabilities of eBPF and integrating NLP based machine learning algorithms, our solution achieves an impressive 99.76% accuracy in identifying ransomware incidents within a few seconds on the onset of zero-day attacks.
- [524] arXiv:2406.14021 [pdf, html, other]
-
Title: HIGHT: Hierarchical Graph Tokenization for Graph-Language AlignmentComments: Preliminary version of an ongoing project: this https URLSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Recently there has been a surge of interest in extending the success of large language models (LLMs) to graph modality, such as social networks and molecules. As LLMs are predominantly trained with 1D text data, most existing approaches adopt a graph neural network to represent a graph as a series of node tokens and feed these tokens to LLMs for graph-language alignment. Despite achieving some successes, existing approaches have overlooked the hierarchical structures that are inherent in graph data. Especially, in molecular graphs, the high-order structural information contains rich semantics of molecular functional groups, which encode crucial biochemical functionalities of the molecules. We establish a simple benchmark showing that neglecting the hierarchical information in graph tokenization will lead to subpar graph-language alignment and severe hallucination in generated outputs. To address this problem, we propose a novel strategy called HIerarchical GrapH Tokenization (HIGHT). HIGHT employs a hierarchical graph tokenizer that extracts and encodes the hierarchy of node, motif, and graph levels of informative tokens to improve the graph perception of LLMs. HIGHT also adopts an augmented graph-language supervised fine-tuning dataset, enriched with the hierarchical graph information, to further enhance the graph-language alignment. Extensive experiments on 7 molecule-centric benchmarks confirm the effectiveness of HIGHT in reducing hallucination by 40%, as well as significant improvements in various molecule-language downstream tasks.
- [525] arXiv:2406.14022 [pdf, html, other]
-
Title: Investigating the Pre-Training Dynamics of In-Context Learning: Task Recognition vs. Task LearningComments: work in progressSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
The emergence of in-context learning (ICL) is potentially attributed to two major abilities: task recognition (TR) for recognizing the task from demonstrations and utilizing pre-trained priors, and task learning (TL) for learning from demonstrations. However, relationships between the two abilities and how such relationships affect the emergence of ICL is unclear. In this paper, we take the first step by examining the pre-training dynamics of the emergence of ICL. With carefully designed metrics, we find that these two abilities are, in fact, competitive during pre-training. Moreover, we observe a strong negative correlation between the competition and ICL performance. Further analysis of common pre-training factors (i.e., model size, dataset size, and data curriculum) demonstrates possible ways to manage the competition. Based on these insights, we propose a simple yet effective method to better integrate these two abilities for ICL at inference time. Through adaptive ensemble learning, the performance of ICL can be significantly boosted, enabling two small models to outperform a larger one with more than twice the parameters. The code is available at this https URL.
- [526] arXiv:2406.14023 [pdf, html, other]
-
Title: Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric PerspectiveComments: Code and datasets are available at this https URLSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
As Large Language Models (LLMs) become an important way of information seeking, there have been increasing concerns about the unethical content LLMs may generate. In this paper, we conduct a rigorous evaluation of LLMs' implicit bias towards certain groups by attacking them with carefully crafted instructions to elicit biased responses. Our attack methodology is inspired by psychometric principles in cognitive and social psychology. We propose three attack approaches, i.e., Disguise, Deception, and Teaching, based on which we built evaluation datasets for four common bias types. Each prompt attack has bilingual versions. Extensive evaluation of representative LLMs shows that 1) all three attack methods work effectively, especially the Deception attacks; 2) GLM-3 performs the best in defending our attacks, compared to GPT-3.5 and GPT-4; 3) LLMs could output content of other bias types when being taught with one type of bias. Our methodology provides a rigorous and effective way of evaluating LLMs' implicit bias and will benefit the assessments of LLMs' potential ethical risks.
- [527] arXiv:2406.14024 [pdf, html, other]
-
Title: The Reason behind Good or Bad: Towards a Better Mathematical Verifier with Natural Language FeedbackBofei Gao, Zefan Cai, Runxin Xu, Peiyi Wang, Ce Zheng, Runji Lin, Keming Lu, Junyang Lin, Chang Zhou, Tianyu Liu, Baobao ChangComments: 9 pagesSubjects: Computation and Language (cs.CL)
Mathematical verfier achieves success in mathematical reasoning tasks by validating the correctness of solutions. However, existing verifiers are trained with binary classification labels, which are not informative enough for the model to accurately assess the solutions. To mitigate the aforementioned insufficiency of binary labels, we introduce step-wise natural language feedbacks as rationale labels (i.e., the correctness of the current step and the explanations). In this paper, we propose \textbf{Math-Minos}, a natural language feedback enhanced verifier by constructing automatically-generated training data and a two-stage training paradigm for effective training and efficient inference. Our experiments reveal that a small set (30k) of natural language feedbacks can significantly boost the performance of the verifier by the accuracy of 1.6\% (86.6\% $\rightarrow$ 88.2\%) on GSM8K and 0.8\% (37.8\% $\rightarrow$ 38.6\%) on MATH. We will release the code, data and model for reproduction soon.
- [528] arXiv:2406.14026 [pdf, html, other]
-
Title: Demystifying Forgetting in Language Model Fine-Tuning with Statistical Analysis of Example AssociationsComments: 5 pagesSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
Language models (LMs) are known to suffer from forgetting of previously learned examples when fine-tuned, breaking stability of deployed LM systems. Despite efforts on mitigating forgetting, few have investigated whether, and how forgotten upstream examples are associated with newly learned tasks. Insights on such associations enable efficient and targeted mitigation of forgetting. In this paper, we empirically analyze forgetting that occurs in $N$ upstream examples while the model learns $M$ new tasks and visualize their associations with a $M \times N$ matrix. We empirically demonstrate that the degree of forgetting can often be approximated by simple multiplicative contributions of the upstream examples and newly learned tasks. We also reveal more complicated patterns where specific subsets of examples are forgotten with statistics and visualization. Following our analysis, we predict forgetting that happens on upstream examples when learning a new task with matrix completion over the empirical associations, outperforming prior approaches that rely on trainable LMs. Project website: this https URL
- [529] arXiv:2406.14027 [pdf, other]
-
Title: How to design a dataset compliant with an ML-based system ODD?Cyril Cappi, Noémie Cohen, Mélanie Ducoffe, Christophe Gabreau, Laurent Gardes, Adrien Gauffriau, Jean-Brice Ginestet, Franck Mamalet, Vincent Mussot, Claire Pagetti, David VigourouxComments: 12th European Congress on Embedded Real Time Software and Systems, Jun 2024, Toulouse, France. arXiv admin note: text overlap with arXiv:2304.09938Subjects: Artificial Intelligence (cs.AI)
This paper focuses on a Vision-based Landing task and presents the design and the validation of a dataset that would comply with the Operational Design Domain (ODD) of a Machine-Learning (ML) system. Relying on emerging certification standards, we describe the process for establishing ODDs at both the system and image levels. In the process, we present the translation of high-level system constraints into actionable image-level properties, allowing for the definition of verifiable Data Quality Requirements (DQRs). To illustrate this approach, we use the Landing Approach Runway Detection (LARD) dataset which combines synthetic imagery and real footage, and we focus on the steps required to verify the DQRs. The replicable framework presented in this paper addresses the challenges of designing a dataset compliant with the stringent needs of ML-based systems certification in safety-critical applications.
- [530] arXiv:2406.14028 [pdf, html, other]
-
Title: Reliable State Estimation in a Truck-Semitrailer Combination using an Artificial Neural Network-Aided Extended Kalman FilterComments: 8 pages, 3 figures, accepted for publication at 2024 European Control Conference (ECC)Subjects: Systems and Control (eess.SY)
Advanced driver assistance systems are critically dependent on reliable and accurate information regarding a vehicles' driving state. For estimation of unknown quantities, model-based and learning-based methods exist, but both suffer from individual limitations. On the one hand, model-based estimation performance is often limited by the models' accuracy. On the other hand, learning-based estimators usually do not perform well in "unknown" conditions (bad generalization), which is particularly critical for semitrailers as their payload changes significantly in operation. To the best of the authors' knowledge, this work is the first to analyze the capability of state-of-the-art estimators for semitrailers to generalize across "unknown" loading states. Moreover, a novel hybrid Extended Kalman Filter (H-EKF) that takes advantage of accurate Artificial Neural Network (ANN) estimates while preserving reliable generalization capability is presented. It estimates the articulation angle between truck and semitrailer, lateral tire forces and the truck steering angle utilizing sensor data of a standard semitrailer only. An experimental comparison based on a full-scale truck-semitrailer combination indicates the superiority of the H-EKF compared to a state-of-the-art extended Kalman filter and an ANN estimator.
- [531] arXiv:2406.14035 [pdf, other]
-
Title: Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation Modelling in Large Multimodal ModelsSherzod Hakimov, Yerkezhan Abdullayeva, Kushal Koshti, Antonia Schmidt, Yan Weiser, Anne Beyer, David SchlangenComments: under reviewSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
While the situation has improved for text-only models, it again seems to be the case currently that multimodal (text and image) models develop faster than ways to evaluate them. In this paper, we bring a recently developed evaluation paradigm from text models to multimodal models, namely evaluation through the goal-oriented game (self) play, complementing reference-based and preference-based evaluation. Specifically, we define games that challenge a model's capability to represent a situation from visual information and align such representations through dialogue. We find that the largest closed models perform rather well on the games that we define, while even the best open-weight models struggle with them. On further analysis, we find that the exceptional deep captioning capabilities of the largest models drive some of the performance. There is still room to grow for both kinds of models, ensuring the continued relevance of the benchmark.
- [532] arXiv:2406.14036 [pdf, html, other]
-
Title: Toward Infinite-Long Prefix in TransformerSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Prompting and contextual-based fine-tuning methods, which we call Prefix Learning, have been proposed to enhance the performance of language models on various downstream tasks that can match full parameter fine-tuning. There remains a limited theoretical understanding of how these methods work. In this paper, we aim to relieve this limitation by studying the learning ability of Prefix Learning from the perspective of prefix length. In particular, we approximate the infinite-long Prefix Learning optimization process by the Neural Tangent Kernel (NTK) technique. We formulate and solve it as a learning problem of the infinite-long prefix in a one-layer attention network. Our results confirm the over-parameterization property and arbitrary small loss convergence guarantee of the infinite-long Prefix Learning in attention. To the implementation end, we propose our NTK-Attention method, which is "equivalent" to attention computation with arbitrary prefix length efficiently. Its time complexity mainly depends on the sub-quadratic of input length (without prefix), and our method only requires $d^2 + d$ extra parameters for representation, where $d$ is the feature dimension. In addition, we conducted experiments that compare our NTK-Attention with full parameters fine-tuning, LoRA, and P-Tuning V2 methods across vision or natural language datasets. The results indicate our approach may be a promising parameter-efficient-fine-tuning method since it has demonstrated superior performance in numerous scenarios. Our code can be found at \url{this https URL}.
- [533] arXiv:2406.14038 [pdf, html, other]
-
Title: Resource-efficient Medical Image Analysis with Self-adapting Forward-Forward NetworksComments: Under ReviewSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
We introduce a fast Self-adapting Forward-Forward Network (SaFF-Net) for medical imaging analysis, mitigating power consumption and resource limitations, which currently primarily stem from the prevalent reliance on back-propagation for model training and fine-tuning. Building upon the recently proposed Forward-Forward Algorithm (FFA), we introduce the Convolutional Forward-Forward Algorithm (CFFA), a parameter-efficient reformulation that is suitable for advanced image analysis and overcomes the speed and generalisation constraints of the original FFA. To address hyper-parameter sensitivity of FFAs we are also introducing a self-adapting framework SaFF-Net fine-tuning parameters during warmup and training in parallel. Our approach enables more effective model training and eliminates the previously essential requirement for an arbitrarily chosen Goodness function in FFA. We evaluate our approach on several benchmarking datasets in comparison with standard Back-Propagation (BP) neural networks showing that FFA-based networks with notably fewer parameters and function evaluations can compete with standard models, especially, in one-shot scenarios and large batch sizes. The code will be available at the time of the conference.
- [534] arXiv:2406.14039 [pdf, other]
-
Title: CryptoGPT: a 7B model rivaling GPT-4 in the task of analyzing and classifying real-time financial newsComments: Journ{é}e Nationale sur la Fouille de Textes, Pascal CUXAC; Adrien GUILLE; C{é}dric LOPEZ, Jun 2024, Lyon (Universit{é} Lumi{è}re Lyon 2), FranceSubjects: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
CryptoGPT: a 7B model competing with GPT-4 in a specific task -- The Impact of Automatic Annotation and Strategic Fine-Tuning via QLoRAIn this article, we present a method aimed at refining a dedicated LLM of reasonable quality with limited resources in an industrial setting via CryptoGPT. It is an LLM designed for financial news analysis for the cryptocurrency market in real-time. This project was launched in an industrial context. This model allows not only for the classification of financial information but also for providing comprehensive analysis. We refined different LLMs of the same size such as Mistral-7B and LLama-7B using semi-automatic annotation and compared them with various LLMs such as GPT-3.5 and GPT-4. Our goal is to find a balance among several needs: 1. Protecting data (by avoiding their transfer to external servers), 2. Limiting annotation cost and time, 3. Controlling the model's size (to manage deployment costs), and 4. Maintaining better analysis quality.
- [535] arXiv:2406.14041 [pdf, other]
-
Title: Discontinuous Galerkin method for a three-dimensional coupled fluid-poroelastic model with applications to brain fluid mechanicsComments: 22 pages, 8 figuresSubjects: Numerical Analysis (math.NA)
The modeling of the interaction between a poroelastic medium and a fluid in a hollow cavity is crucial for understanding, e.g., the multiphysics flow of blood and Cerebrospinal Fluid (CSF) in the brain, the supply of blood by the coronary arteries in heart perfusion, or the interaction between groundwater and rivers or lakes. In particular, the cerebral tissue's elasticity and its perfusion by blood and interstitial CSF can be described by Multi-compartment Poroelasticity (MPE), while CSF flow in the brain ventricles can be modeled by the (Navier-)Stokes equations, the overall system resulting in a coupled MPE-(Navier-)Stokes system. The aim of this paper is three-fold. First, we aim to extend and verify in a three-dimensional setting a discontinuous Galerkin method on polytopal grids recently presented for the MPE-Stokes problem. Second, we carry out the analysis of the method based on an extension of the proposed formulation so that physics-based Beavers-Joseph-Saffman conditions are taken into account at the interface: these conditions are essential to model the friction between the fluid and the porous medium. Finally, by a comparative numerical investigation, we assess the fluid-dynamics effects of these boundary conditions and of employing either Stokes or Navier-Stokes equations to model the CSF flow. The semidiscrete numerical scheme for the coupled problem is proved to be stable and optimally convergent. Temporal discretization is obtained using Newmark's $\beta$-method for the elastic wave equation and the $\theta$-method for the remaining equations of the model. The theoretical error estimates are verified by numerical simulations on a test case with a manufactured solution, and a numerical investigation is carried out on a three-dimensional geometry to assess the effects of interface conditions and fluid inertia on the system.
- [536] arXiv:2406.14043 [pdf, html, other]
-
Title: Taxonomy-Guided Zero-Shot Recommendations with LLMsSubjects: Information Retrieval (cs.IR); Computation and Language (cs.CL)
With the emergence of large language models (LLMs) and their ability to perform a variety of tasks, their application in recommender systems (RecSys) has shown promise. However, we are facing significant challenges when deploying LLMs into RecSys, such as limited prompt length, unstructured item information, and un-constrained generation of recommendations, leading to sub-optimal performance. To address these issues, we propose a novel method using a taxonomy dictionary. This method provides a systematic framework for categorizing and organizing items, improving the clarity and structure of item information. By incorporating the taxonomy dictionary into LLM prompts, we achieve efficient token utilization and controlled feature generation, leading to more accurate and contextually relevant recommendations. Our Taxonomy-guided Recommendation (TaxRec) approach features a two-step process: one-time taxonomy categorization and LLM-based recommendation, enabling zero-shot recommendations without the need for domain-specific fine-tuning. Experimental results demonstrate TaxRec significantly enhances recommendation quality compared to traditional zero-shot approaches, showcasing its efficacy as personal recommender with LLMs. Code is available at this https URL.
- [537] arXiv:2406.14045 [pdf, html, other]
-
Title: Understanding Different Design Choices in Training Large Time Series ModelsYu-Neng Chuang, Songchen Li, Jiayi Yuan, Guanchu Wang, Kwei-Herng Lai, Leisheng Yu, Sirui Ding, Chia-Yuan Chang, Qiaoyu Tan, Daochen Zha, Xia HuSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Inspired by Large Language Models (LLMs), Time Series Forecasting (TSF), a long-standing task in time series analysis, is undergoing a transition towards Large Time Series Models (LTSMs), aiming to train universal transformer-based models for TSF. However, training LTSMs on heterogeneous time series data poses unique challenges, including diverse frequencies, dimensions, and patterns across datasets. Recent endeavors have studied and evaluated various design choices aimed at enhancing LTSM training and generalization capabilities, spanning pre-processing techniques, model configurations, and dataset configurations. In this work, we comprehensively analyze these design choices and aim to identify the best practices for training LTSM. Moreover, we propose \emph{time series prompt}, a novel statistical prompting strategy tailored to time series data. Furthermore, based on the observations in our analysis, we introduce \texttt{LTSM-bundle}, which bundles the best design choices we have identified. Empirical results demonstrate that \texttt{LTSM-bundle} achieves superior zero-shot and few-shot performances compared to state-of-the-art LSTMs and traditional TSF methods on benchmark datasets.
- [538] arXiv:2406.14047 [pdf, html, other]
-
Title: Constrained Meta Agnostic Reinforcement LearningSubjects: Machine Learning (cs.LG)
Meta-Reinforcement Learning (Meta-RL) aims to acquire meta-knowledge for quick adaptation to diverse tasks. However, applying these policies in real-world environments presents a significant challenge in balancing rapid adaptability with adherence to environmental constraints. Our novel approach, Constraint Model Agnostic Meta Learning (C-MAML), merges meta learning with constrained optimization to address this challenge. C-MAML enables rapid and efficient task adaptation by incorporating task-specific constraints directly into its meta-algorithm framework during the training phase. This fusion results in safer initial parameters for learning new tasks. We demonstrate the effectiveness of C-MAML in simulated locomotion with wheeled robot tasks of varying complexity, highlighting its practicality and robustness in dynamic environments.
- [539] arXiv:2406.14048 [pdf, html, other]
-
Title: Prompt Injection Attacks in Defended SystemsSubjects: Computation and Language (cs.CL)
Large language models play a crucial role in modern natural language processing technologies. However, their extensive use also introduces potential security risks, such as the possibility of black-box attacks. These attacks can embed hidden malicious features into the model, leading to adverse consequences during its deployment.
This paper investigates methods for black-box attacks on large language models with a three-tiered defense mechanism. It analyzes the challenges and significance of these attacks, highlighting their potential implications for language processing system security. Existing attack and defense methods are examined, evaluating their effectiveness and applicability across various scenarios.
Special attention is given to the detection algorithm for black-box attacks, identifying hazardous vulnerabilities in language models and retrieving sensitive information. This research presents a methodology for vulnerability detection and the development of defensive strategies against black-box attacks on large language models. - [540] arXiv:2406.14050 [pdf, html, other]
-
Title: Gaze-directed Vision GNN for Mitigating Shortcut Learning in Medical ImageSubjects: Computer Vision and Pattern Recognition (cs.CV)
Deep neural networks have demonstrated remarkable performance in medical image analysis. However, its susceptibility to spurious correlations due to shortcut learning raises concerns about network interpretability and reliability. Furthermore, shortcut learning is exacerbated in medical contexts where disease indicators are often subtle and sparse. In this paper, we propose a novel gaze-directed Vision GNN (called GD-ViG) to leverage the visual patterns of radiologists from gaze as expert knowledge, directing the network toward disease-relevant regions, and thereby mitigating shortcut learning. GD-ViG consists of a gaze map generator (GMG) and a gaze-directed classifier (GDC). Combining the global modelling ability of GNNs with the locality of CNNs, GMG generates the gaze map based on radiologists' visual patterns. Notably, it eliminates the need for real gaze data during inference, enhancing the network's practical applicability. Utilizing gaze as the expert knowledge, the GDC directs the construction of graph structures by incorporating both feature distances and gaze distances, enabling the network to focus on disease-relevant foregrounds. Thereby avoiding shortcut learning and improving the network's interpretability. The experiments on two public medical image datasets demonstrate that GD-ViG outperforms the state-of-the-art methods, and effectively mitigates shortcut learning. Our code is available at this https URL.
- [541] arXiv:2406.14051 [pdf, html, other]
-
Title: How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model CharacteristicsComments: under reviewSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
What makes a good Large Language Model (LLM)? That it performs well on the relevant benchmarks -- which hopefully measure, with some validity, the presence of capabilities that are also challenged in real application. But what makes the model perform well? What gives a model its abilities? We take a recently introduced type of benchmark that is meant to challenge capabilities in a goal-directed, agentive context through self-play of conversational games, and analyse how performance develops as a function of model characteristics like number of parameters, or type of training. We find that while there is a clear relationship between number of parameters and performance, there is still a wide spread of performance points within a given size bracket, which is to be accounted for by training parameters such as fine-tuning data quality and method. From a more practical angle, we also find a certain degree of unpredictability about performance across access methods, possible due to unexposed sampling parameters, and a, very welcome, performance stability against at least moderate weight quantisation during inference.
- [542] arXiv:2406.14054 [pdf, html, other]
-
Title: Urban-Focused Multi-Task Offline Reinforcement Learning with Contrastive Data SharingComments: KDD 2024Subjects: Machine Learning (cs.LG)
Enhancing diverse human decision-making processes in an urban environment is a critical issue across various applications, including ride-sharing vehicle dispatching, public transportation management, and autonomous driving. Offline reinforcement learning (RL) is a promising approach to learn and optimize human urban strategies (or policies) from pre-collected human-generated spatial-temporal urban data. However, standard offline RL faces two significant challenges: (1) data scarcity and data heterogeneity, and (2) distributional shift. In this paper, we introduce MODA -- a Multi-Task Offline Reinforcement Learning with Contrastive Data Sharing approach. MODA addresses the challenges of data scarcity and heterogeneity in a multi-task urban setting through Contrastive Data Sharing among tasks. This technique involves extracting latent representations of human behaviors by contrasting positive and negative data pairs. It then shares data presenting similar representations with the target task, facilitating data augmentation for each task. Moreover, MODA develops a novel model-based multi-task offline RL algorithm. This algorithm constructs a robust Markov Decision Process (MDP) by integrating a dynamics model with a Generative Adversarial Network (GAN). Once the robust MDP is established, any online RL or planning algorithm can be applied. Extensive experiments conducted in a real-world multi-task urban setting validate the effectiveness of MODA. The results demonstrate that MODA exhibits significant improvements compared to state-of-the-art baselines, showcasing its capability in advancing urban decision-making processes. We also made our code available to the research community.
- [543] arXiv:2406.14056 [pdf, html, other]
-
Title: VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-TuningComments: 18 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV)
Recent advances in Large Vision-Language Models (LVLMs) have significantly improve performance in image comprehension tasks, such as formatted charts and rich-content images. Yet, Graphical User Interface (GUI) pose a greater challenge due to their structured format and detailed textual information. Existing LVLMs often overly depend on internal knowledge and neglect image content, resulting in hallucinations and incorrect responses in GUI this http URL address these issues, we introduce VGA, a fine-tuned model designed for comprehensive GUI understanding. Our model aims to enhance the interpretation of visual data of GUI and reduce hallucinations. We first construct a Vision Question Answering (VQA) dataset of 63.8k high-quality examples with our propose Referent Method, which ensures the model's responses are highly depend on visual content within the image. We then design a two-stage fine-tuning method called Foundation and Advanced Comprehension (FAC) to enhance both the model's ability to extract information from image content and alignment with human intent. Experiments show that our approach enhances the model's ability to extract information from images and achieves state-of-the-art results in GUI understanding tasks. Our dataset and fine-tuning script will be released soon.
- [544] arXiv:2406.14059 [pdf, other]
-
Title: Tracking solutions of time-varying variational inequalitiesSubjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Tracking the solution of time-varying variational inequalities is an important problem with applications in game theory, optimization, and machine learning. Existing work considers time-varying games or time-varying optimization problems. For strongly convex optimization problems or strongly monotone games, these results provide tracking guarantees under the assumption that the variation of the time-varying problem is restrained, that is, problems with a sublinear solution path. In this work we extend existing results in two ways: In our first result, we provide tracking bounds for (1) variational inequalities with a sublinear solution path but not necessarily monotone functions, and (2) for periodic time-varying variational inequalities that do not necessarily have a sublinear solution path-length. Our second main contribution is an extensive study of the convergence behavior and trajectory of discrete dynamical systems of periodic time-varying VI. We show that these systems can exhibit provably chaotic behavior or can converge to the solution. Finally, we illustrate our theoretical results with experiments.
- [545] arXiv:2406.14064 [pdf, html, other]
-
Title: PAPR Reduction with Pre-chirp Selection for Affine Frequency Division MultipleSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
Affine frequency division multiplexing (AFDM) is a promising new multicarrier technique based on discrete affine Fourier transform (DAFT). By properly tuning pre-chirp parameter and post-chirp parameter in the DAFT, the effective channel in the DAFT domain can completely avoid overlap of different paths, thus constitutes a full representation of delay-Doppler profile, which significantly improves the system performance in high mobility scenarios. However, AFDM has the crucial problem of high peak-to-average power ratio (PAPR) caused by phase randomness of modulated symbols. In this letter, an algorithm named grouped pre-chirp selection (GPS) is proposed to reduce the PAPR by changing the value of pre-chirp parameter on sub-carriers group by group. Specifically, it is demonstrated first that the important properties of AFDM system are maintained when implementing GPS. Secondly, we elaborate the operation steps of GPS algorithm, illustrating its effect on PAPR reduction and its advantage in terms of computational complexity compared with the ungrouped approach. Finally, simulation results of PAPR reduction in the form of complementary cumulative distribution function (CCDF) show the effectiveness of the proposed GPS algorithm.
- [546] arXiv:2406.14065 [pdf, other]
-
Title: On one-step numerical schemes of weak convergence for SDEs with super-linear coefficientsSubjects: Numerical Analysis (math.NA); Probability (math.PR)
We consider weak convergence of one-step schemes for solving stochastic differential equations (SDEs) with one-sided Lipschitz conditions. It is known that the super-linear coefficients may lead to a blowup of moments of solutions and their numerical solutions. When solutions to SDEs have all finite moments, weak convergence of numerical schemes has been investigated in [Wang et al (2023), Weak error analysis for strong approximation schemes of SDEs with super-linear coefficients, IMA Journal numerical analysis]. Some modified Euler schemes have been analyzed for weak convergence. In this work, we present a family of explicit schemes of first and second-order weak convergence based on classical schemes for SDEs. We explore the effects of limited moments on these schemes. We provide a systematic but simple way to establish weak convergence orders for schemes based on approximations/modifications of drift and diffusion coefficients. We present several numerical examples of these schemes and show their weak convergence orders.
- [547] arXiv:2406.14066 [pdf, html, other]
-
Title: Optimizing Speculative Decoding for Serving Large Language Models Using GoodputXiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao ZhangSubjects: Artificial Intelligence (cs.AI); Performance (cs.PF)
Reducing the inference latency of large language models (LLMs) is crucial, and speculative decoding (SD) stands out as one of the most effective techniques. Rather than letting the LLM generate all tokens directly, speculative decoding employs effective proxies to predict potential outputs, which are then verified by the LLM without compromising the generation quality. Yet, deploying SD in real online LLM serving systems (with continuous batching) does not always yield improvement -- under higher request rates or low speculation accuracy, it paradoxically increases latency. Furthermore, there is no best speculation length work for all workloads under different system loads. Based on the observations, we develop a dynamic framework SmartSpec. SmartSpec dynamically determines the best speculation length for each request (from 0, i.e., no speculation, to many tokens) -- hence the associated speculative execution costs -- based on a new metric called goodput, which characterizes the current observed load of the entire system and the speculation accuracy. We show that SmartSpec consistently reduces average request latency by up to 3.2x compared to non-speculative decoding baselines across different sizes of target models, draft models, request rates, and datasets. Moreover, SmartSpec can be applied to different styles of speculative decoding, including traditional, model-based approaches as well as model-free methods like prompt lookup and tree-style decoding.
- [548] arXiv:2406.14068 [pdf, html, other]
-
Title: Classifying Dry Eye Disease Patients from Healthy Controls Using Machine Learning and Metabolomics DataSajad Amouei Sheshkal, Morten Gundersen, Michael Alexander Riegler, Øygunn Aass Utheim, Kjell Gunnar Gundersen, Hugo Lewi HammerSubjects: Computer Vision and Pattern Recognition (cs.CV)
Dry eye disease is a common disorder of the ocular surface, leading patients to seek eye care. Clinical signs and symptoms are currently used to diagnose dry eye disease. Metabolomics, a method for analyzing biological systems, has been found helpful in identifying distinct metabolites in patients and in detecting metabolic profiles that may indicate dry eye disease at early stages. In this study, we explored using machine learning and metabolomics information to identify which cataract patients suffered from dry eye disease. As there is no one-size-fits-all machine learning model for metabolomics data, choosing the most suitable model can significantly affect the quality of predictions and subsequent metabolomics analyses. To address this challenge, we conducted a comparative analysis of nine machine learning models on three metabolomics data sets from cataract patients with and without dry eye disease. The models were evaluated and optimized using nested k-fold cross-validation. To assess the performance of these models, we selected a set of suitable evaluation metrics tailored to the data set's challenges. The logistic regression model overall performed the best, achieving the highest area under the curve score of 0.8378, balanced accuracy of 0.735, Matthew's correlation coefficient of 0.5147, an F1-score of 0.8513, and a specificity of 0.5667. Additionally, following the logistic regression, the XGBoost and Random Forest models also demonstrated good performance.
- [549] arXiv:2406.14073 [pdf, html, other]
-
Title: Exploring Layerwise Adversarial Robustness Through the Lens of t-SNESubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Adversarial examples, designed to trick Artificial Neural Networks (ANNs) into producing wrong outputs, highlight vulnerabilities in these models. Exploring these weaknesses is crucial for developing defenses, and so, we propose a method to assess the adversarial robustness of image-classifying ANNs. The t-distributed Stochastic Neighbor Embedding (t-SNE) technique is used for visual inspection, and a metric, which compares the clean and perturbed embeddings, helps pinpoint weak spots in the layers. Analyzing two ANNs on CIFAR-10, one designed by humans and another via NeuroEvolution, we found that differences between clean and perturbed representations emerge early on, in the feature extraction layers, affecting subsequent classification. The findings with our metric are supported by the visual analysis of the t-SNE maps.
- [550] arXiv:2406.14075 [pdf, html, other]
-
Title: EXCEEDS: Extracting Complex Events as Connecting the Dots to Graphs in Scientific DomainComments: This paper is working in processSubjects: Computation and Language (cs.CL)
It is crucial to utilize events to understand a specific domain. There are lots of research on event extraction in many domains such as news, finance and biology domain. However, scientific domain still lacks event extraction research, including comprehensive datasets and corresponding methods. Compared to other domains, scientific domain presents two characteristics: denser nuggets and more complex events. To solve the above problem, considering these two characteristics, we first construct SciEvents, a large-scale multi-event document-level dataset with a schema tailored for scientific domain. It has 2,508 documents and 24,381 events under refined annotation and quality control. Then, we propose EXCEEDS, a novel end-to-end scientific event extraction framework by storing dense nuggets in a grid matrix and simplifying complex event extraction into a dot construction and connection task. Experimental results demonstrate state-of-the-art performances of EXCEEDS on SciEvents. Additionally, we release SciEvents and EXCEEDS on GitHub.
- [551] arXiv:2406.14077 [pdf, other]
-
Title: GTP-UDrive: Unified Game-Theoretic Trajectory Planner and Decision-Maker for Autonomous Driving in Mixed Traffic EnvironmentsJournal-ref: 35th IEEE Intelligent Vehicles Symposium (IV 2024), Jun 2024, Jeju Island, South KoreaSubjects: Robotics (cs.RO)
Understanding the interdependence between autonomous and human-operated vehicles remains an ongoing challenge, with significant implications for the safety and feasibility of autonomous driving.This interdependence arises from inherent interactions among road users.Thus, it is crucial for Autonomous Vehicles (AVs) to understand and analyze the intentions of human-driven vehicles, and to display behavior comprehensible to other traffic this http URL this end, this paper presents GTP-UDRIVE, a unified game-theoretic trajectory planner and decision-maker considering a mixed-traffic environment. Our model considers the intentions of other vehicles in the decision-making process and provides the AV with a human-like trajectory, based on the clothoid interpolation technique.% This study investigates a solver based on Particle Swarm Optimization (PSO) that quickly converges to an optimal decision.Among highly interactive traffic scenarios, the intersection crossing is particularly challenging. Hence, we choose to demonstrate the feasibility and effectiveness of our method in real traffic conditions, using an experimental autonomous vehicle at an unsignalized intersection. Testing results reveal that our approach is suitable for 1) Making decisions and generating trajectories simultaneously. 2) Describing the vehicle's trajectory as a piecewise clothoid and enforcing geometric constraints. 3) Reducing search space dimensionality for the trajectory optimization problem.
- [552] arXiv:2406.14080 [pdf, html, other]
-
Title: CMTNet: Convolutional Meets Transformer Network for Hyperspectral Images ClassificationComments: 15 pages, 11figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Hyperspectral remote sensing (HIS) enables the detailed capture of spectral information from the Earth's surface, facilitating precise classification and identification of surface crops due to its superior spectral diagnostic capabilities. However, current convolutional neural networks (CNNs) focus on local features in hyperspectral data, leading to suboptimal performance when classifying intricate crop types and addressing imbalanced sample distributions. In contrast, the Transformer framework excels at extracting global features from hyperspectral imagery. To leverage the strengths of both approaches, this research introduces the Convolutional Meet Transformer Network (CMTNet). This innovative model includes a spectral-spatial feature extraction module for shallow feature capture, a dual-branch structure combining CNN and Transformer branches for local and global feature extraction, and a multi-output constraint module that enhances classification accuracy through multi-output loss calculations and cross constraints across local, international, and joint features. Extensive experiments conducted on three datasets (WHU-Hi-LongKou, WHU-Hi-HanChuan, and WHU-Hi-HongHu) demonstrate that CTDBNet significantly outperforms other state-of-the-art networks in classification performance, validating its effectiveness in hyperspectral crop classification.
- [553] arXiv:2406.14081 [pdf, other]
-
Title: COOK Access Control on an embedded Volta GPUSubjects: Hardware Architecture (cs.AR)
The last decade has seen the emergence of a new generation of multi-core in response to advances in machine learning, and in particular Deep Neural Network (DNN) training and inference tasks. These platforms, like the JETSON AGX XAVIER, embed several cores and accelerators in a SWaP- efficient (Size Weight and Power) package with a limited set of resources. However, concurrent applications tend to interfere on shared resources, resulting in high execution time variability for applications compared to their behaviour in isolation.Access control techniques aim to selectively restrict the flow of operations executed by a resource. To reduce the impact of interference on the JETSON Volta GPU, we specify and implement an access control technique to ensure each GPU operation executes in isolation to reduce its timing variability. We implement the controller using three different strategies and assess their complexity and impact on the application performance. Our evaluation shows the benefits of adding the access control: its transparency to applications, reduced timing variability, isolation between GPU operations, and small code complexity. However, the strategies may cause some potential slowdowns for applications even in isolation but which are reasonable.
- [554] arXiv:2406.14082 [pdf, other]
-
Title: FLoCoRA: Federated learning compression with low-rank adaptationLucas Grativol Ribeiro (IMT Atlantique - MEE, Lab\_STICC\_BRAIn, Lab-STICC\_2AI, LHC), Mathieu Leonardon (IMT Atlantique - MEE, Lab\_STICC\_BRAIn), Guillaume Muller (Mines Saint-Étienne MSE, FAYOL-ENSMSE, FAYOL-ENSMSE), Virginie Fresse (LHC, TSE), Matthieu Arzel (IMT Atlantique - MEE, Lab-STICC\_2AI)Journal-ref: 32nd European Signal Processing Conference EUSIPCO, Aug 2024, Lyon, FranceSubjects: Machine Learning (cs.LG); Signal Processing (eess.SP)
Low-Rank Adaptation (LoRA) methods have gained popularity in efficient parameter fine-tuning of models containing hundreds of billions of parameters. In this work, instead, we demonstrate the application of LoRA methods to train small-vision models in Federated Learning (FL) from scratch. We first propose an aggregation-agnostic method to integrate LoRA within FL, named FLoCoRA, showing that the method is capable of reducing communication costs by 4.8 times, while having less than 1% accuracy degradation, for a CIFAR-10 classification task with a ResNet-8. Next, we show that the same method can be extended with an affine quantization scheme, dividing the communication cost by 18.6 times, while comparing it with the standard method, with still less than 1% of accuracy loss, tested with on a ResNet-18 model. Our formulation represents a strong baseline for message size reduction, even when compared to conventional model compression works, while also reducing the training memory requirements due to the low-rank adaptation.
- [555] arXiv:2406.14085 [pdf, other]
-
Title: Teaching Models To Survive: Proper Scoring Rule and Stochastic Optimization with Competing RisksJulie Alberge (SODA), Vincent Maladière, Olivier Grisel, Judith Abécassis (SODA), Gaël Varoquaux (SODA)Subjects: Artificial Intelligence (cs.AI)
When data are right-censored, i.e. some outcomes are missing due to a limited period of observation, survival analysis can compute the "time to event". Multiple classes of outcomes lead to a classification variant: predicting the most likely event, known as competing risks, which has been less studied. To build a loss that estimates outcome probabilities for such settings, we introduce a strictly proper censoring-adjusted separable scoring rule that can be optimized on a subpart of the data because the evaluation is made independently of observations. It enables stochastic optimization for competing risks which we use to train gradient boosting trees. Compared to 11 state-of-the-art models, this model, MultiIncidence, performs best in estimating the probability of outcomes in survival and competing risks. It can predict at any time horizon and is much faster than existing alternatives.
- [556] arXiv:2406.14086 [pdf, other]
-
Title: Seg-LSTM: Performance of xLSTM for Semantic Segmentation of Remotely Sensed ImagesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Recent advancements in autoregressive networks with linear complexity have driven significant research progress, demonstrating exceptional performance in large language models. A representative model is the Extended Long Short-Term Memory (xLSTM), which incorporates gating mechanisms and memory structures, performing comparably to Transformer architectures in long-sequence language tasks. Autoregressive networks such as xLSTM can utilize image serialization to extend their application to visual tasks such as classification and segmentation. Although existing studies have demonstrated Vision-LSTM's impressive results in image classification, its performance in image semantic segmentation remains unverified. Our study represents the first attempt to evaluate the effectiveness of Vision-LSTM in the semantic segmentation of remotely sensed images. This evaluation is based on a specifically designed encoder-decoder architecture named Seg-LSTM, and comparisons with state-of-the-art segmentation networks. Our study found that Vision-LSTM's performance in semantic segmentation was limited and generally inferior to Vision-Transformers-based and Vision-Mamba-based models in most comparative tests. Future research directions for enhancing Vision-LSTM are recommended. The source code is available from this https URL.
- [557] arXiv:2406.14087 [pdf, other]
-
Title: Semi Supervised Heterogeneous Domain Adaptation via Disentanglement and Pseudo-LabellingJournal-ref: ECML PKDD: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Sep 2024, Vilnius, LithuaniaSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Semi-supervised domain adaptation methods leverage information from a source labelled domain with the goal of generalizing over a scarcely labelled target domain. While this setting already poses challenges due to potential distribution shifts between domains, an even more complex scenario arises when source and target data differs in modality representation (e.g. they are acquired by sensors with different characteristics). For instance, in remote sensing, images may be collected via various acquisition modes (e.g. optical or radar), different spectral characteristics (e.g. RGB or multi-spectral) and spatial resolutions. Such a setting is denoted as Semi-Supervised Heterogeneous Domain Adaptation (SSHDA) and it exhibits an even more severe distribution shift due to modality heterogeneity across this http URL cope with the challenging SSHDA setting, here we introduce SHeDD (Semi-supervised Heterogeneous Domain Adaptation via Disentanglement) an end-to-end neural framework tailored to learning a target domain classifier by leveraging both labelled and unlabelled data from heterogeneous data sources. SHeDD is designed to effectively disentangle domain-invariant representations, relevant for the downstream task, from domain-specific information, that can hinder the cross-modality transfer. Additionally, SHeDD adopts an augmentation-based consistency regularization mechanism that takes advantages of reliable pseudo-labels on the unlabelled target samples to further boost its generalization ability on the target domain. Empirical evaluations on two remote sensing benchmarks, encompassing heterogeneous data in terms of acquisition modes and spectral/spatial resolutions, demonstrate the quality of SHeDD compared to both baseline and state-of-the-art competing approaches. Our code is publicly available here: this https URL
- [558] arXiv:2406.14088 [pdf, html, other]
-
Title: ReaLHF: Optimized RLHF Training for Large Language Models through Parameter ReallocationComments: 13 pages (15 pages with references), 13 figuresSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Reinforcement Learning from Human Feedback (RLHF) stands as a pivotal technique in empowering large language model (LLM) applications. Since RLHF involves diverse computational workloads and intricate dependencies among multiple LLMs, directly adopting parallelization techniques from supervised training can result in sub-optimal performance. To overcome this limitation, we propose a novel approach named parameter ReaLlocation, which dynamically redistributes LLM parameters in the cluster and adapts parallelization strategies during training. Building upon this idea, we introduce ReaLHF, a pioneering system capable of automatically discovering and running efficient execution plans for RLHF training given the desired algorithmic and hardware configurations. ReaLHF formulates the execution plan for RLHF as an augmented dataflow graph. Based on this formulation, ReaLHF employs a tailored search algorithm with a lightweight cost estimator to discover an efficient execution plan. Subsequently, the runtime engine deploys the selected plan by effectively parallelizing computations and redistributing parameters. We evaluate ReaLHF on the LLaMA-2 models with up to $4\times70$ billion parameters and 128 GPUs. The experiment results showcase ReaLHF's substantial speedups of $2.0-10.6\times$ compared to baselines. Furthermore, the execution plans generated by ReaLHF exhibit an average of $26\%$ performance improvement over heuristic approaches based on Megatron-LM. The source code of ReaLHF is publicly available at this https URL .
- [559] arXiv:2406.14090 [pdf, html, other]
-
Title: Personalized Music Recommendation with a Heterogeneity-aware Deep Bayesian NetworkComments: 34 pages, 19 figuresSubjects: Artificial Intelligence (cs.AI)
Music recommender systems are crucial in music streaming platforms, providing users with music they would enjoy. Recent studies have shown that user emotions can affect users' music mood preferences. However, existing emotion-aware music recommender systems (EMRSs) explicitly or implicitly assume that users' actual emotional states expressed by an identical emotion word are homogeneous. They also assume that users' music mood preferences are homogeneous under an identical emotional state. In this article, we propose four types of heterogeneity that an EMRS should consider: emotion heterogeneity across users, emotion heterogeneity within a user, music mood preference heterogeneity across users, and music mood preference heterogeneity within a user. We further propose a Heterogeneity-aware Deep Bayesian Network (HDBN) to model these assumptions. The HDBN mimics a user's decision process to choose music with four components: personalized prior user emotion distribution modeling, posterior user emotion distribution modeling, user grouping, and Bayesian neural network-based music mood preference prediction. We constructed a large-scale dataset called EmoMusicLJ to validate our method. Extensive experiments demonstrate that our method significantly outperforms baseline approaches on widely used HR and NDCG recommendation metrics. Ablation experiments and case studies further validate the effectiveness of our HDBN. The source code is available at this https URL.
- [560] arXiv:2406.14091 [pdf, html, other]
-
Title: Protecting Privacy Through Approximating Optimal Parameters for Sequence Unlearning in Language ModelsComments: Accepted to ACL2024 findingsSubjects: Computation and Language (cs.CL)
Although language models (LMs) demonstrate exceptional capabilities on various tasks, they are potentially vulnerable to extraction attacks, which represent a significant privacy risk. To mitigate the privacy concerns of LMs, machine unlearning has emerged as an important research area, which is utilized to induce the LM to selectively forget about some of its training data. While completely retraining the model will guarantee successful unlearning and privacy assurance, it is impractical for LMs, as it would be time-consuming and resource-intensive. Prior works efficiently unlearn the target token sequences, but upon subsequent iterations, the LM displays significant degradation in performance. In this work, we propose Privacy Protection via Optimal Parameters (POP), a novel unlearning method that effectively forgets the target token sequences from the pretrained LM by applying optimal gradient updates to the parameters. Inspired by the gradient derivation of complete retraining, we approximate the optimal training objective that successfully unlearns the target sequence while retaining the knowledge from the rest of the training data. Experimental results demonstrate that POP exhibits remarkable retention performance post-unlearning across 9 classification and 4 dialogue benchmarks, outperforming the state-of-the-art by a large margin. Furthermore, we introduce Remnant Memorization Accuracy that quantifies privacy risks based on token likelihood and validate its effectiveness through both qualitative and quantitative analyses.
- [561] arXiv:2406.14092 [pdf, html, other]
-
Title: Seamless Language Expansion: Enhancing Multilingual Mastery in Self-Supervised ModelsComments: Accepted by Interspeech 2024Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Self-supervised (SSL) models have shown great performance in various downstream tasks. However, they are typically developed for limited languages, and may encounter new languages in real-world. Developing a SSL model for each new language is costly. Thus, it is vital to figure out how to efficiently adapt existed SSL models to a new language without impairing its original abilities. We propose adaptation methods which integrate LoRA to existed SSL models to extend new language. We also develop preservation strategies which include data combination and re-clustering to retain abilities on existed languages. Applied to mHuBERT, we investigate their effectiveness on speech re-synthesis task. Experiments show that our adaptation methods enable mHuBERT to be applied to a new language (Mandarin) with MOS value increased about 1.6 and the relative value of WER reduced up to 61.72%. Also, our preservation strategies ensure that the performance on both existed and new languages remains intact.
- [562] arXiv:2406.14095 [pdf, html, other]
-
Title: Memory-Efficient Gradient Unrolling for Large-Scale Bi-level OptimizationQianli Shen, Yezhen Wang, Zhouhao Yang, Xiang Li, Haonan Wang, Yang Zhang, Jonathan Scarlett, Zhanxing Zhu, Kenji KawaguchiSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Bi-level optimization (BO) has become a fundamental mathematical framework for addressing hierarchical machine learning problems. As deep learning models continue to grow in size, the demand for scalable bi-level optimization solutions has become increasingly critical. Traditional gradient-based bi-level optimization algorithms, due to their inherent characteristics, are ill-suited to meet the demands of large-scale applications. In this paper, we introduce $\textbf{F}$orward $\textbf{G}$radient $\textbf{U}$nrolling with $\textbf{F}$orward $\textbf{F}$radient, abbreviated as $(\textbf{FG})^2\textbf{U}$, which achieves an unbiased stochastic approximation of the meta gradient for bi-level optimization. $(\text{FG})^2\text{U}$ circumvents the memory and approximation issues associated with classical bi-level optimization approaches, and delivers significantly more accurate gradient estimates than existing large-scale bi-level optimization approaches. Additionally, $(\text{FG})^2\text{U}$ is inherently designed to support parallel computing, enabling it to effectively leverage large-scale distributed computing systems to achieve significant computational efficiency. In practice, $(\text{FG})^2\text{U}$ and other methods can be strategically placed at different stages of the training process to achieve a more cost-effective two-phase paradigm. Further, $(\text{FG})^2\text{U}$ is easy to implement within popular deep learning frameworks, and can be conveniently adapted to address more challenging zeroth-order bi-level optimization scenarios. We provide a thorough convergence analysis and a comprehensive practical discussion for $(\text{FG})^2\text{U}$, complemented by extensive empirical evaluations, showcasing its superior performance in diverse large-scale bi-level optimization tasks.
- [563] arXiv:2406.14096 [pdf, html, other]
-
Title: Graph Neural Networks for Job Shop Scheduling Problems: A SurveyIgor G. Smit, Jianan Zhou, Robbert Reijnen, Yaoxin Wu, Jian Chen, Cong Zhang, Zaharah Bukhsh, Wim Nuijten, Yingqian ZhangSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Job shop scheduling problems (JSSPs) represent a critical and challenging class of combinatorial optimization problems. Recent years have witnessed a rapid increase in the application of graph neural networks (GNNs) to solve JSSPs, albeit lacking a systematic survey of the relevant literature. This paper aims to thoroughly review prevailing GNN methods for different types of JSSPs and the closely related flow-shop scheduling problems (FSPs), especially those leveraging deep reinforcement learning (DRL). We begin by presenting the graph representations of various JSSPs, followed by an introduction to the most commonly used GNN architectures. We then review current GNN-based methods for each problem type, highlighting key technical elements such as graph representations, GNN architectures, GNN tasks, and training algorithms. Finally, we summarize and analyze the advantages and limitations of GNNs in solving JSSPs and provide potential future research opportunities. We hope this survey can motivate and inspire innovative approaches for more powerful GNN-based approaches in tackling JSSPs and other scheduling problems.
- [564] arXiv:2406.14097 [pdf, other]
-
Title: Enhancing the LLM-Based Robot Manipulation Through Human-Robot CollaborationHaokun Liu, Yaonan Zhu, Kenji Kato, Atsushi Tsukahara, Izumi Kondo, Tadayoshi Aoyama, Yasuhisa HasegawaComments: IEEE Robotics and Automation LettersSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Large Language Models (LLMs) are gaining popularity in the field of robotics. However, LLM-based robots are limited to simple, repetitive motions due to the poor integration between language models, robots, and the environment. This paper proposes a novel approach to enhance the performance of LLM-based autonomous manipulation through Human-Robot Collaboration (HRC). The approach involves using a prompted GPT-4 language model to decompose high-level language commands into sequences of motions that can be executed by the robot. The system also employs a YOLO-based perception algorithm, providing visual cues to the LLM, which aids in planning feasible motions within the specific environment. Additionally, an HRC method is proposed by combining teleoperation and Dynamic Movement Primitives (DMP), allowing the LLM-based robot to learn from human guidance. Real-world experiments have been conducted using the Toyota Human Support Robot for manipulation tasks. The outcomes indicate that tasks requiring complex trajectory planning and reasoning over environments can be efficiently accomplished through the incorporation of human demonstrations.
- [565] arXiv:2406.14098 [pdf, html, other]
-
Title: HeartBeat: Towards Controllable Echocardiography Video Synthesis with Multimodal Conditions-Guided Diffusion ModelsComments: Accepted by MICCAI 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Echocardiography (ECHO) video is widely used for cardiac examination. In clinical, this procedure heavily relies on operator experience, which needs years of training and maybe the assistance of deep learning-based systems for enhanced accuracy and efficiency. However, it is challenging since acquiring sufficient customized data (e.g., abnormal cases) for novice training and deep model development is clinically unrealistic. Hence, controllable ECHO video synthesis is highly desirable. In this paper, we propose a novel diffusion-based framework named HeartBeat towards controllable and high-fidelity ECHO video synthesis. Our highlight is three-fold. First, HeartBeat serves as a unified framework that enables perceiving multimodal conditions simultaneously to guide controllable generation. Second, we factorize the multimodal conditions into local and global ones, with two insertion strategies separately provided fine- and coarse-grained controls in a composable and flexible manner. In this way, users can synthesize ECHO videos that conform to their mental imagery by combining multimodal control signals. Third, we propose to decouple the visual concepts and temporal dynamics learning using a two-stage training scheme for simplifying the model training. One more interesting thing is that HeartBeat can easily generalize to mask-guided cardiac MRI synthesis in a few shots, showcasing its scalability to broader applications. Extensive experiments on two public datasets show the efficacy of the proposed HeartBeat.
- [566] arXiv:2406.14099 [pdf, html, other]
-
Title: Let Guidelines Guide You: A Prescriptive Guideline-Centered Data Annotation MethodologyFederico Ruggeri, Eleonora Misino, Arianna Muti, Katerina Korre, Paolo Torroni, Alberto Barrón-CedeñoSubjects: Computation and Language (cs.CL)
We introduce the Guideline-Centered annotation process, a novel data annotation methodology focused on reporting the annotation guidelines associated with each data sample. We identify three main limitations of the standard prescriptive annotation process and describe how the Guideline-Centered methodology overcomes them by reducing the loss of information in the annotation process and ensuring adherence to guidelines. Additionally, we discuss how the Guideline-Centered enables the reuse of annotated data across multiple tasks at the cost of a single human-annotation process.
- [567] arXiv:2406.14102 [pdf, html, other]
-
Title: SeCTIS: A Framework to Secure CTI SharingDincy R. Arikkat, Mert Cihangiroglu, Mauro Conti, Rafidha Rehiman K. A., Serena Nicolazzo, Antonino Nocera, Vinod PSubjects: Cryptography and Security (cs.CR)
The rise of IT-dependent operations in modern organizations has heightened their vulnerability to cyberattacks. As a growing number of organizations include smart, interconnected devices in their systems to automate their processes, the attack surface becomes much bigger, and the complexity and frequency of attacks pose a significant threat. Consequently, organizations have been compelled to seek innovative approaches to mitigate the menaces inherent in their infrastructure. In response, considerable research efforts have been directed towards creating effective solutions for sharing Cyber Threat Intelligence (CTI). Current information-sharing methods lack privacy safeguards, leaving organizations vulnerable to leaks of both proprietary and confidential data. To tackle this problem, we designed a novel framework called SeCTIS (Secure Cyber Threat Intelligence Sharing), integrating Swarm Learning and Blockchain technologies to enable businesses to collaborate, preserving the privacy of their CTI data. Moreover, our approach provides a way to assess the data and model quality, and the trustworthiness of all the participants leveraging some validators through Zero Knowledge Proofs. An extensive experimental campaign demonstrates our framework's correctness and performance, and the detailed attack model discusses its robustness against attacks in the context of data and model quality.
- [568] arXiv:2406.14103 [pdf, html, other]
-
Title: Two-Stage Depth Enhanced Learning with Obstacle Map For Object NavigationSubjects: Artificial Intelligence (cs.AI)
The task that requires an agent to navigate to a given object through only visual observation is called visual object navigation (VON). The main bottlenecks of VON are strategies exploration and prior knowledge exploitation. Traditional strategies exploration ignores the differences of searching and navigating stages, using the same reward in two stages, which reduces navigation performance and training efficiency. Our study enables the agent to explore larger area in searching stage and seek the optimal path in navigating stage, improving the success rate of navigation. Traditional prior knowledge exploitation focused on learning and utilizing object association, which ignored the depth and obstacle information in the environment. This paper uses the RGB and depth information of the training scene to pretrain the feature extractor, which improves navigation efficiency. The obstacle information is memorized by the agent during the navigation, reducing the probability of collision and deadlock. Depth, obstacle and other prior knowledge are concatenated and input into the policy network, and navigation actions are output under the training of two-stage rewards. We evaluated our method on AI2-Thor and RoboTHOR and demonstrated that it significantly outperforms state-of-the-art (SOTA) methods on success rate and navigation efficiency.
- [569] arXiv:2406.14106 [pdf, html, other]
-
Title: EasyECR: A Library for Easy Implementation and Evaluation of Event Coreference Resolution ModelsComments: 14 pages, 4 figures, 12 tablesSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Event Coreference Resolution (ECR) is the task of clustering event mentions that refer to the same real-world event. Despite significant advancements, ECR research faces two main challenges: limited generalizability across domains due to narrow dataset evaluations, and difficulties in comparing models within diverse ECR pipelines. To address these issues, we develop EasyECR, the first open-source library designed to standardize data structures and abstract ECR pipelines for easy implementation and fair evaluation. More specifically, EasyECR integrates seven representative pipelines and ten popular benchmark datasets, enabling model evaluations on various datasets and promoting the development of robust ECR pipelines. By conducting extensive evaluation via our EasyECR, we find that, \lowercase\expandafter{\romannumeral1}) the representative ECR pipelines cannot generalize across multiple datasets, hence evaluating ECR pipelines on multiple datasets is necessary, \lowercase\expandafter{\romannumeral2}) all models in ECR pipelines have a great effect on pipeline performance, therefore, when one model in ECR pipelines are compared, it is essential to ensure that the other models remain consistent. Additionally, reproducing ECR results is not trivial, and the developed library can help reduce this discrepancy. The experimental results provide valuable baselines for future research.
- [570] arXiv:2406.14111 [pdf, html, other]
-
Title: Expander Hierarchies for Normalized Cuts on GraphsComments: Accepted to KDD'24, August 25-29, 2024, Barcelona, SpainSubjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
Expander decompositions of graphs have significantly advanced the understanding of many classical graph problems and led to numerous fundamental theoretical results. However, their adoption in practice has been hindered due to their inherent intricacies and large hidden factors in their asymptotic running times. Here, we introduce the first practically efficient algorithm for computing expander decompositions and their hierarchies and demonstrate its effectiveness and utility by incorporating it as the core component in a novel solver for the normalized cut graph clustering objective.
Our extensive experiments on a variety of large graphs show that our expander-based algorithm outperforms state-of-the-art solvers for normalized cut with respect to solution quality by a large margin on a variety of graph classes such as citation, e-mail, and social networks or web graphs while remaining competitive in running time. - [571] arXiv:2406.14114 [pdf, html, other]
-
Title: Dye4AI: Assuring Data Boundary on Generative AI ServicesSubjects: Cryptography and Security (cs.CR)
Generative artificial intelligence (AI) is versatile for various applications, but security and privacy concerns with third-party AI vendors hinder its broader adoption in sensitive scenarios. Hence, it is essential for users to validate the AI trustworthiness and ensure the security of data boundaries. In this paper, we present a dye testing system named Dye4AI, which injects crafted trigger data into human-AI dialogue and observes AI responses towards specific prompts to diagnose data flow in AI model evolution. Our dye testing procedure contains 3 stages: trigger generation, trigger insertion, and trigger retrieval. First, to retain both uniqueness and stealthiness, we design a new trigger that transforms a pseudo-random number to a intelligible format. Second, with a custom-designed three-step conversation strategy, we insert each trigger item into dialogue and confirm the model memorizes the new trigger knowledge in the current session. Finally, we routinely try to recover triggers with specific prompts in new sessions, as triggers can present in new sessions only if AI vendors leverage user data for model fine-tuning. Extensive experiments on six LLMs demonstrate our dye testing scheme is effective in ensuring the data boundary, even for models with various architectures and parameter sizes. Also, larger and premier models tend to be more suitable for Dye4AI, e.g., trigger can be retrieved in OpenLLaMa-13B even with only 2 insertions per trigger item. Moreover, we analyze the prompt selection in dye testing, providing insights for future testing systems on generative AI services.
- [572] arXiv:2406.14115 [pdf, html, other]
-
Title: Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language ModelsSubjects: Computation and Language (cs.CL)
Data selection for fine-tuning Large Language Models (LLMs) aims to select a high-quality subset from a given candidate dataset to train a Pending Fine-tune Model (PFM) into a Selective-Enhanced Model (SEM). It can improve the model performance and accelerate the training process. Although a few surveys have investigated related works of data selection, there is a lack of comprehensive comparison between existing methods due to their various experimental settings. To address this issue, we first propose a three-stage scheme for data selection and comprehensively review existing works according to this scheme. Then, we design a unified comparing method with ratio-based efficiency indicators and ranking-based feasibility indicators to overcome the difficulty of comparing various models with diverse experimental settings. After an in-depth comparative analysis, we find that the more targeted method with data-specific and model-specific quality labels has higher efficiency, but the introduction of additional noise information should be avoided when designing selection algorithms. Finally, we summarize the trends in data selection and highlight the short-term and long-term challenges to guide future research.
- [573] arXiv:2406.14117 [pdf, html, other]
-
Title: An Investigation of Prompt Variations for Zero-shot LLM-based RankersSubjects: Information Retrieval (cs.IR); Computation and Language (cs.CL)
We provide a systematic understanding of the impact of specific components and wordings used in prompts on the effectiveness of rankers based on zero-shot Large Language Models (LLMs). Several zero-shot ranking methods based on LLMs have recently been proposed. Among many aspects, methods differ across (1) the ranking algorithm they implement, e.g., pointwise vs. listwise, (2) the backbone LLMs used, e.g., GPT3.5 vs. FLAN-T5, (3) the components and wording used in prompts, e.g., the use or not of role-definition (role-playing) and the actual words used to express this. It is currently unclear whether performance differences are due to the underlying ranking algorithm, or because of spurious factors such as better choice of words used in prompts. This confusion risks to undermine future research. Through our large-scale experimentation and analysis, we find that ranking algorithms do contribute to differences between methods for zero-shot LLM ranking. However, so do the LLM backbones -- but even more importantly, the choice of prompt components and wordings affect the ranking. In fact, in our experiments, we find that, at times, these latter elements have more impact on the ranker's effectiveness than the actual ranking algorithms, and that differences among ranking methods become more blurred when prompt variations are considered.
- [574] arXiv:2406.14119 [pdf, html, other]
-
Title: Entropy stable hydrostatic reconstruction schemes for shallow water systemsComments: 28 pages, 10 figures, submitted to Journal of Computational PhysicsSubjects: Numerical Analysis (math.NA)
In this work, we develop a new hydrostatic reconstruction procedure to construct well-balanced schemes for one and multilayer shallow water flows, including wetting and drying. Initially, we derive the method for a path-conservative finite volume scheme and combine it with entropy conservative fluxes and suitable numerical dissipation to preserve an entropy inequality in the semi-discrete case. We then combine the novel hydrostatic reconstruction with a collocated nodal split-form discontinuous Galerkin spectral element method, extending the method to high-order and curvilinear meshes. The high-order method incorporates an additional positivity-limiter and is blended with a compatible subcell finite volume method to maintain well-balancedness at wet/dry fronts. We prove entropy stability, well-balancedness, and positivity-preservation for both methods. Numerical results for the high-order method validate the theoretical findings and demonstrate the robustness of the scheme.
- [575] arXiv:2406.14120 [pdf, html, other]
-
Title: Boosting Hyperspectral Image Classification with Gate-Shift-Fuse Mechanisms in a Novel CNN-Transformer ApproachSubjects: Computer Vision and Pattern Recognition (cs.CV)
During the process of classifying Hyperspectral Image (HSI), every pixel sample is categorized under a land-cover type. CNN-based techniques for HSI classification have notably advanced the field by their adept feature representation capabilities. However, acquiring deep features remains a challenge for these CNN-based methods. In contrast, transformer models are adept at extracting high-level semantic features, offering a complementary strength. This paper's main contribution is the introduction of an HSI classification model that includes two convolutional blocks, a Gate-Shift-Fuse (GSF) block and a transformer block. This model leverages the strengths of CNNs in local feature extraction and transformers in long-range context modelling. The GSF block is designed to strengthen the extraction of local and global spatial-spectral features. An effective attention mechanism module is also proposed to enhance the extraction of information from HSI cubes. The proposed method is evaluated on four well-known datasets (the Indian Pines, Pavia University, WHU-WHU-Hi-LongKou and WHU-Hi-HanChuan), demonstrating that the proposed framework achieves superior results compared to other models.
- [576] arXiv:2406.14122 [pdf, html, other]
-
Title: EduQate: Generating Adaptive Curricula through RMABs in Education SettingsComments: 9 pages, NeurIPS 2024 pre-printSubjects: Artificial Intelligence (cs.AI)
There has been significant interest in the development of personalized and adaptive educational tools that cater to a student's individual learning progress. A crucial aspect in developing such tools is in exploring how mastery can be achieved across a diverse yet related range of content in an efficient manner. While Reinforcement Learning and Multi-armed Bandits have shown promise in educational settings, existing works often assume the independence of learning content, neglecting the prevalent interdependencies between such content. In response, we introduce Education Network Restless Multi-armed Bandits (EdNetRMABs), utilizing a network to represent the relationships between interdependent arms. Subsequently, we propose EduQate, a method employing interdependency-aware Q-learning to make informed decisions on arm selection at each time step. We establish the optimality guarantee of EduQate and demonstrate its efficacy compared to baseline policies, using students modeled from both synthetic and real-world data.
- [577] arXiv:2406.14123 [pdf, other]
-
Title: Mapping AI Ethics Narratives: Evidence from Twitter Discourse Between 2015 and 2022Comments: 22 pages, 6 figuresSubjects: Computers and Society (cs.CY)
Public participation is indispensable for an insightful understanding of the ethics issues raised by AI technologies. Twitter is selected in this paper to serve as an online public sphere for exploring discourse on AI ethics, facilitating broad and equitable public engagement in the development of AI technology. A research framework is proposed to demonstrate how to transform AI ethics-related discourse on Twitter into coherent and readable narratives. It consists of two parts: 1) combining neural networks with large language models to construct a topic hierarchy that contains popular topics of public concern without ignoring small but important voices, thus allowing a fine-grained exploration of meaningful information. 2) transforming fragmented and difficult-to-understand social media information into coherent and easy-to-read stories through narrative visualization, providing a new perspective for understanding the information in Twitter data. This paper aims to advocate for policy makers to enhance public oversight of AI technologies so as to promote their fair and sustainable development.
- [578] arXiv:2406.14124 [pdf, html, other]
-
Title: Measuring Sample Importance in Data Pruning for Training LLMs from a Data Compression PerspectiveSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Compute-efficient training of large language models (LLMs) has become an important research problem. In this work, we consider data pruning as a method of data-efficient training of LLMs, where we take a data compression view on data pruning. We argue that the amount of information of a sample, or the achievable compression on its description length, represents its sample importance. The key idea is that, less informative samples are likely to contain redundant information, and thus should be pruned first. We leverage log-likelihood function of trained models as a surrogate to measure information content of samples. Experiments reveal a surprising insight that information-based pruning can enhance the generalization capability of the model, improves upon language modeling and downstream tasks as compared to the model trained on the entire dataset.
- [579] arXiv:2406.14125 [pdf, html, other]
-
Title: On Unique Error Patterns in the Levenshtein's Sequence Reconstruction ModelSubjects: Information Theory (cs.IT)
In the Levenshtein's sequence reconstruction problem a codeword is transmitted through $N$ channels and in each channel a set of errors is introduced to the transmitted word. In previous works, the restriction that each channel provides a unique output word has been essential. In this work, we assume only that each channel introduces a unique set of errors to the transmitted word and hence some output words can also be identical. As we will discuss, this interpretation is both natural and useful for deletion and insertion errors. We give properties, techniques and (optimal) results for this situation.
Quaternary alphabets are relevant due to applications related to DNA-memories. Hence, we introduce an efficient Las Vegas style decoding algorithm for simultaneous insertion, deletion and substitution errors in $q$-ary Hamming spaces for $q \geq 4$. - [580] arXiv:2406.14129 [pdf, html, other]
-
Title: Towards Event-oriented Long Video UnderstandingYifan Du, Kun Zhou, Yuqi Huo, Yifan Li, Wayne Xin Zhao, Haoyu Lu, Zijia Zhao, Bingning Wang, Weipeng Chen, Ji-Rong WenComments: Work on progressSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
With the rapid development of video Multimodal Large Language Models (MLLMs), numerous benchmarks have been proposed to assess their video understanding capability. However, due to the lack of rich events in the videos, these datasets may suffer from the short-cut bias that the answers can be deduced from a few frames, without the need to watch the entire video. To address this issue, we introduce Event-Bench, an event-oriented long video understanding benchmark built on existing datasets and human annotations. Event-Bench includes six event-related tasks and 2,190 test instances to comprehensively evaluate video event understanding ability. Additionally, we propose Video Instruction Merging~(VIM), a cost-effective method that enhances video MLLMs using merged, event-intensive video instructions, addressing the scarcity of human-annotated, event-intensive data. Extensive experiments show that the best-performing model, GPT-4o, achieves an overall accuracy of 53.33, significantly outperforming the best open-source model by 41.42%. Leveraging an effective instruction synthesis method and an adaptive model architecture, VIM surpasses both state-of-the-art open-source models and GPT-4V on the Event-Bench. All code, data, and models are publicly available at this https URL.
- [581] arXiv:2406.14130 [pdf, html, other]
-
Title: ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-TuningComments: 8 pages, 5 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Recently, advancements in video synthesis have attracted significant attention. Video synthesis models such as AnimateDiff and Stable Video Diffusion have demonstrated the practical applicability of diffusion models in creating dynamic visual content. The emergence of SORA has further spotlighted the potential of video generation technologies. Nonetheless, the extension of video lengths has been constrained by the limitations in computational resources. Most existing video synthesis models can only generate short video clips. In this paper, we propose a novel post-tuning methodology for video synthesis models, called ExVideo. This approach is designed to enhance the capability of current video synthesis models, allowing them to produce content over extended temporal durations while incurring lower training expenditures. In particular, we design extension strategies across common temporal model architectures respectively, including 3D convolution, temporal attention, and positional embedding. To evaluate the efficacy of our proposed post-tuning approach, we conduct extension training on the Stable Video Diffusion model. Our approach augments the model's capacity to generate up to $5\times$ its original number of frames, requiring only 1.5k GPU hours of training on a dataset comprising 40k videos. Importantly, the substantial increase in video length doesn't compromise the model's innate generalization capabilities, and the model showcases its advantages in generating videos of diverse styles and resolutions. We will release the source code and the enhanced model publicly.
- [582] arXiv:2406.14131 [pdf, html, other]
-
Title: Detecting sexually explicit content in the context of the child sexual abuse materials (CSAM): end-to-end classifiers and region-based networksComments: Accepted at ECML PKDD 2023 - SoGOOD WorkshopSubjects: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
Child sexual abuse materials (CSAM) pose a significant threat to the safety and well-being of children worldwide. Detecting and preventing the distribution of such materials is a critical task for law enforcement agencies and technology companies. As content moderation is often manual, developing an automated detection system can help reduce human reviewers' exposure to potentially harmful images and accelerate the process of counteracting. This study presents methods for classifying sexually explicit content, which plays a crucial role in the automated CSAM detection system. Several approaches are explored to solve the task: an end-to-end classifier, a classifier with person detection and a private body parts detector. All proposed methods are tested on the images obtained from the online tool for reporting illicit content. Due to legal constraints, access to the data is limited, and all algorithms are executed remotely on the isolated server. The end-to-end classifier yields the most promising results, with an accuracy of 90.17%, after augmenting the training set with the additional neutral samples and adult pornography. While detection-based methods may not achieve higher accuracy rates and cannot serve as a final classifier on their own, their inclusion in the system can be beneficial. Human body-oriented approaches generate results that are easier to interpret, and obtaining more interpretable results is essential when analyzing models that are trained without direct access to data.
- [583] arXiv:2406.14132 [pdf, html, other]
-
Title: Enhancing Monotonic Modeling with Spatio-Temporal Adaptive Awareness in Diverse MarketingComments: 7 pagesSubjects: Artificial Intelligence (cs.AI)
In the mobile internet era, the Online Food Ordering Service (OFOS) emerges as an integral component of inclusive finance owing to the convenience it brings to people. OFOS platforms offer dynamic allocation incentives to users and merchants through diverse marketing campaigns to encourage payments while maintaining the platforms' budget efficiency. Despite significant progress, the marketing domain continues to face two primary challenges: (i) how to allocate a limited budget with greater efficiency, demanding precision in predicting users' monotonic response (i.e. sensitivity) to incentives, and (ii) ensuring spatio-temporal adaptability and robustness in diverse marketing campaigns across different times and locations. To address these issues, we propose a Constrained Monotonic Adaptive Network (CoMAN) method for spatio-temporal perception within marketing pricing. Specifically, we capture spatio-temporal preferences within attribute features through two foundational spatio-temporal perception modules. To further enhance catching the user sensitivity differentials to incentives across varied times and locations, we design modules for learning spatio-temporal convexity and concavity as well as for expressing sensitivity functions. CoMAN can achieve a more efficient allocation of incentive investments during pricing, thus increasing the conversion rate and orders while maintaining budget efficiency. Extensive offline and online experimental results within our diverse marketing campaigns demonstrate the effectiveness of the proposed approach while outperforming the monotonic state-of-the-art method.
- [584] arXiv:2406.14135 [pdf, html, other]
-
Title: Autonomous Robotic Drilling System for Mice Cranial Window CreationComments: 15 pages, 14 figures, to be submitted to IEEESubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Robotic assistance for experimental manipulation in the life sciences is expected to enable favorable outcomes, regardless of the skill of the scientist. Experimental specimens in the life sciences are subject to individual variability hence require intricate algorithms for successful autonomous robotic control. As a use case, we are studying the creation of cranial windows in mice. This operation requires the removal of an 8-mm-circular patch of the skull, which is approximately 300 um thick, but the shape and thickness of the mouse skull significantly varies depending on the strain of mouse, sex, and age. In this work, we propose an autonomous robotic drilling method with no offline planning, consisting of a trajectory planning block with execution-time feedback with completion level recognition based on image and force information. The force information allows for completion-level resolution to increase 10 fold. We evaluate the proposed method in two ways. First, in an eggshell drilling task and achieved a success rate of 95% and average drilling time of 7.1 min out of 20 trials. Second, in postmortem mice and with a success rate of 70% and average drilling time of 9.3 min out of 20 trials.
- [585] arXiv:2406.14136 [pdf, html, other]
-
Title: One Fling to Goal: Environment-aware Dynamics for Goal-conditioned Fabric FlingingSubjects: Robotics (cs.RO)
Fabric manipulation dynamically is commonly seen in manufacturing and domestic settings. While dynamically manipulating a fabric piece to reach a target state is highly efficient, this task presents considerable challenges due to the varying properties of different fabrics, complex dynamics when interacting with environments, and meeting required goal conditions. To address these challenges, we present \textit{One Fling to Goal}, an algorithm capable of handling fabric pieces with diverse shapes and physical properties across various scenarios. Our method learns a graph-based dynamics model equipped with environmental awareness. With this dynamics model, we devise a real-time controller to enable high-speed fabric manipulation in one attempt, requiring less than 3 seconds to finish the goal-conditioned task. We experimentally validate our method on a goal-conditioned manipulation task in five diverse scenarios. Our method significantly improves this goal-conditioned task, achieving an average error of 13.2mm in complex scenarios. Our method can be seamlessly transferred to real-world robotic systems and generalized to unseen scenarios in a zero-shot manner.
- [586] arXiv:2406.14137 [pdf, html, other]
-
Title: MACAROON: Training Vision-Language Models To Be Your Engaged PartnersComments: The code will be made public at this https URLSubjects: Computation and Language (cs.CL)
Large vision-language models (LVLMs), while proficient in following instructions and responding to diverse questions, invariably generate detailed responses even when questions are ambiguous or unanswerable, leading to hallucinations and bias issues. Thus, it is essential for LVLMs to proactively engage with humans to ask for clarifications or additional information for better responses. In this study, we aim to shift LVLMs from passive answer providers to proactive engaged partners. We begin by establishing a three-tiered hierarchy for questions of invalid, ambiguous, and personalizable nature to measure the proactive engagement capabilities of LVLMs. Utilizing this hierarchy, we create PIE, (ProactIve Engagement Evaluation) through GPT-4o and human annotators, consisting of 853 questions across six distinct, fine-grained question types that are verified by human annotators and accompanied with well-defined metrics. Our evaluations on \benchmark indicate poor performance of existing LVLMs, with the best-performing open-weights model only achieving an Aggregate Align Rate (AAR) of 0.28. In response, we introduce MACAROON, self-iMaginAtion for ContrAstive pReference OptimizatiON, which instructs LVLMs to autonomously generate contrastive response pairs for unlabeled questions given the task description and human-crafted criteria. Then, the self-imagined data is formatted for conditional reinforcement learning. Experimental results show MACAROON effectively improves LVLMs' capabilities to be proactively engaged (0.84 AAR) while maintaining comparable performance on general tasks.
- [587] arXiv:2406.14139 [pdf, html, other]
-
Title: Comparing the Effects of Visual, Haptic, and Visuohaptic Encoding on Memory Retention of Digital Objects in Virtual RealityLucas Siqueira Rodrigues, Timo Torsten Schmidt, John Nyakatura, Stefan Zachow, Johann Habakuk Israel, Thomas KoschComments: Accepted at NordiCHI 2024Subjects: Human-Computer Interaction (cs.HC)
Although Virtual Reality (VR) has undoubtedly improved human interaction with 3D data, users still face difficulties retaining important details of complex digital objects in preparation for physical tasks. To address this issue, we evaluated the potential of visuohaptic integration to improve the memorability of virtual objects in immersive visualizations. In a user study (N=20), participants performed a delayed match-to-sample task where they memorized stimuli of visual, haptic, or visuohaptic encoding conditions. We assessed performance differences between the conditions through error rates and response time. We found that visuohaptic encoding significantly improved memorization accuracy compared to unimodal visual and haptic conditions. Our analysis indicates that integrating haptics into immersive visualizations enhances the memorability of digital objects. We discuss its implications for the optimal encoding design in VR applications that assist professionals who need to memorize and recall virtual objects in their daily work.
- [588] arXiv:2406.14141 [pdf, html, other]
-
Title: Online Learning of Weakly Coupled MDP Policies for Load Balancing and Auto ScalingSubjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
Load balancing and auto scaling are at the core of scalable, contemporary systems, addressing dynamic resource allocation and service rate adjustments in response to workload changes. This paper introduces a novel model and algorithms for tuning load balancers coupled with auto scalers, considering bursty traffic arriving at finite queues. We begin by presenting the problem as a weakly coupled Markov Decision Processes (MDP), solvable via a linear program (LP). However, as the number of control variables of such LP grows combinatorially, we introduce a more tractable relaxed LP formulation, and extend it to tackle the problem of online parameter learning and policy optimization using a two-timescale algorithm based on the LP Lagrangian.
- [589] arXiv:2406.14143 [pdf, html, other]
-
Title: Using the Transport of Intensity and the Transport of Phase Equation for Phase RetrievalSubjects: Numerical Analysis (math.NA)
We investigate the transport of intensity equation (TIE) and the transport of phase equation (TPE) for solving the phase retrieval problem. Both the TIE and the TPE are derived from the paraxial Helmholtz equation and relate phase information to the intensity. The TIE is usually favored since the TPE is nonlinear. The main contribution of this paper is that we discuss situations in which it is possible to use the two equations in a hybrid manner: We show that 2-dimensional phase information retrieved by the TIE can be used as initial data for the TPE, enabling the acquisition of 3-dimensional phase information. The latter is solved using the method of characteristic and viscosity methods. Both the TIE and the viscosity method are numerically implemented with finite element methods.
- [590] arXiv:2406.14144 [pdf, html, other]
-
Title: Finding Safety Neurons in Large Language ModelsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Large language models (LLMs) excel in various capabilities but also pose safety risks such as generating harmful content and misinformation, even after safety alignment. In this paper, we explore the inner mechanisms of safety alignment from the perspective of mechanistic interpretability, focusing on identifying and analyzing safety neurons within LLMs that are responsible for safety behaviors. We propose generation-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects. Experiments on multiple recent LLMs show that: (1) Safety neurons are sparse and effective. We can restore $90$% safety performance with intervention only on about $5$% of all the neurons. (2) Safety neurons encode transferrable mechanisms. They exhibit consistent effectiveness on different red-teaming datasets. The finding of safety neurons also interprets "alignment tax". We observe that the identified key neurons for safety and helpfulness significantly overlap, but they require different activation patterns of the shared neurons. Furthermore, we demonstrate an application of safety neurons in detecting unsafe outputs before generation. Our findings may promote further research on understanding LLM alignment. The source codes will be publicly released to facilitate future research.
- [591] arXiv:2406.14150 [pdf, html, other]
-
Title: Multi-modal Transfer Learning between Biological Foundation ModelsJuan Jose Garau-Luis, Patrick Bordes, Liam Gonzalez, Masa Roller, Bernardo P. de Almeida, Lorenz Hexemer, Christopher Blum, Stefan Laurent, Jan Grzegorzewski, Maren Lang, Thomas Pierrot, Guillaume RichardSubjects: Machine Learning (cs.LG)
Biological sequences encode fundamental instructions for the building blocks of life, in the form of DNA, RNA, and proteins. Modeling these sequences is key to understand disease mechanisms and is an active research area in computational biology. Recently, Large Language Models have shown great promise in solving certain biological tasks but current approaches are limited to a single sequence modality (DNA, RNA, or protein). Key problems in genomics intrinsically involve multiple modalities, but it remains unclear how to adapt general-purpose sequence models to those cases. In this work we propose a multi-modal model that connects DNA, RNA, and proteins by leveraging information from different pre-trained modality-specific encoders. We demonstrate its capabilities by applying it to the largely unsolved problem of predicting how multiple RNA transcript isoforms originate from the same gene (i.e. same DNA sequence) and map to different transcription expression levels across various human tissues. We show that our model, dubbed IsoFormer, is able to accurately predict differential transcript expression, outperforming existing methods and leveraging the use of multiple modalities. Our framework also achieves efficient transfer knowledge from the encoders pre-training as well as in between modalities. We open-source our model, paving the way for new multi-modal gene expression approaches.
- [592] arXiv:2406.14154 [pdf, html, other]
-
Title: Watching the Watchers: A Comparative Fairness Audit of Cloud-based Content Moderation ServicesComments: Accepted at European Workshop on Algorithmic Fairness (EWAF'24)Subjects: Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Online platforms face the challenge of moderating an ever-increasing volume of content, including harmful hate speech. In the absence of clear legal definitions and a lack of transparency regarding the role of algorithms in shaping decisions on content moderation, there is a critical need for external accountability. Our study contributes to filling this gap by systematically evaluating four leading cloud-based content moderation services through a third-party audit, highlighting issues such as biases against minorities and vulnerable groups that may arise through over-reliance on these services. Using a black-box audit approach and four benchmark data sets, we measure performance in explicit and implicit hate speech detection as well as counterfactual fairness through perturbation sensitivity analysis and present disparities in performance for certain target identity groups and data sets. Our analysis reveals that all services had difficulties detecting implicit hate speech, which relies on more subtle and codified messages. Moreover, our results point to the need to remove group-specific bias. It seems that biases towards some groups, such as Women, have been mostly rectified, while biases towards other groups, such as LGBTQ+ and PoC remain.
- [593] arXiv:2406.14155 [pdf, html, other]
-
Title: Aligning Large Language Models with Diverse Political ViewpointsSubjects: Computation and Language (cs.CL)
Large language models such as ChatGPT often exhibit striking political biases. If users query them about political information, they might take a normative stance and reinforce such biases. To overcome this, we align LLMs with diverse political viewpoints from 100,000 comments written by candidates running for national parliament in Switzerland. Such aligned models are able to generate more accurate political viewpoints from Swiss parties compared to commercial models such as ChatGPT. We also propose a procedure to generate balanced overviews from multiple viewpoints using such models.
- [594] arXiv:2406.14156 [pdf, html, other]
-
Title: Tractable Equilibrium Computation in Markov Games through Risk AversionComments: preprint of multi-agent RL with risk-averse equilibriaSubjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
A significant roadblock to the development of principled multi-agent reinforcement learning is the fact that desired solution concepts like Nash equilibria may be intractable to compute. To overcome this obstacle, we take inspiration from behavioral economics and show that -- by imbuing agents with important features of human decision-making like risk aversion and bounded rationality -- a class of risk-averse quantal response equilibria (RQE) become tractable to compute in all $n$-player matrix and finite-horizon Markov games. In particular, we show that they emerge as the endpoint of no-regret learning in suitably adjusted versions of the games. Crucially, the class of computationally tractable RQE is independent of the underlying game structure and only depends on agents' degree of risk-aversion and bounded rationality. To validate the richness of this class of solution concepts we show that it captures peoples' patterns of play in a number of 2-player matrix games previously studied in experimental economics. Furthermore, we give a first analysis of the sample complexity of computing these equilibria in finite-horizon Markov games when one has access to a generative model and validate our findings on a simple multi-agent reinforcement learning benchmark.
- [595] arXiv:2406.14161 [pdf, other]
-
Title: Iterative Sizing Field Prediction for Adaptive Mesh Generation From Expert DemonstrationsNiklas Freymuth, Philipp Dahlinger, Tobias Würth, Philipp Becker, Aleksandar Taranovic, Onno Grönheim, Luise Kärger, Gerhard NeumannComments: Accepted as a workshop paper in AI4Science@ICML 2024Subjects: Machine Learning (cs.LG)
Many engineering systems require accurate simulations of complex physical systems. Yet, analytical solutions are only available for simple problems, necessitating numerical approximations such as the Finite Element Method (FEM). The cost and accuracy of the FEM scale with the resolution of the underlying computational mesh. To balance computational speed and accuracy meshes with adaptive resolution are used, allocating more resources to critical parts of the geometry. Currently, practitioners often resort to hand-crafted meshes, which require extensive expert knowledge and are thus costly to obtain. Our approach, Adaptive Meshing By Expert Reconstruction (AMBER), views mesh generation as an imitation learning problem. AMBER combines a graph neural network with an online data acquisition scheme to predict the projected sizing field of an expert mesh on a given intermediate mesh, creating a more accurate subsequent mesh. This iterative process ensures efficient and accurate imitation of expert mesh resolutions on arbitrary new geometries during inference. We experimentally validate AMBER on heuristic 2D meshes and 3D meshes provided by a human expert, closely matching the provided demonstrations and outperforming a single-step CNN baseline.
- [596] arXiv:2406.14162 [pdf, html, other]
-
Title: DIRAS: Efficient LLM-Assisted Annotation of Document Relevance in Retrieval Augmented GenerationSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Retrieval Augmented Generation (RAG) is widely employed to ground responses to queries on domain-specific documents. But do RAG implementations leave out important information or excessively include irrelevant information? To allay these concerns, it is necessary to annotate domain-specific benchmarks to evaluate information retrieval (IR) performance, as relevance definitions vary across queries and domains. Furthermore, such benchmarks should be cost-efficiently annotated to avoid annotation selection bias. In this paper, we propose DIRAS (Domain-specific Information Retrieval Annotation with Scalability), a manual-annotation-free schema that fine-tunes open-sourced LLMs to annotate relevance labels with calibrated relevance probabilities. Extensive evaluation shows that DIRAS fine-tuned models achieve GPT-4-level performance on annotating and ranking unseen (query, document) pairs, and is helpful for real-world RAG development.
- [597] arXiv:2406.14163 [pdf, html, other]
-
Title: A Unified Statistical And Computational Framework For Ex-Post Harmonisation Of Aggregate StatisticsSubjects: Databases (cs.DB); Methodology (stat.ME)
Ex-post harmonisation is one of many data preprocessing processes used to combine the increasingly vast and diverse sources of data available for research and analysis. Documenting provenance and ensuring the quality of multi-source datasets is vital for ensuring trustworthy scientific research and encouraging reuse of existing harmonisation efforts. However, capturing and communicating statistically relevant properties of harmonised datasets is difficult without a universal standard for describing harmonisation operations. Our paper combines mathematical and computer science perspectives to address this need. The Crossmaps Framework defines a new approach for transforming existing variables collected under a specific measurement or classification standard to an imputed counterfactual variable indexed by some target standard. It uses computational graphs to separate intended transformation logic from actual data transformations, and avoid the risk of syntactically valid data manipulation scripts resulting in statistically questionable data. In this paper, we introduce the Crossmaps Framework through the example of ex-post harmonisation of aggregated statistics in the social sciences. We define a new provenance task abstraction, the crossmap transform, and formalise two associated objects, the shared mass array and the crossmap. We further define graph, matrix and list encodings of crossmaps and discuss resulting implications for understanding statistical properties of ex-post harmonisation and designing error minimising workflows.
- [598] arXiv:2406.14164 [pdf, html, other]
-
Title: A Data-Driven Guided Decoding Mechanism for Diagnostic CaptioningPanagiotis Kaliosis, John Pavlopoulos, Foivos Charalampakos, Georgios Moschovis, Ion AndroutsopoulosComments: [Pre-print] ACL Findings 2024, 17 pages, 7 figures, 7 tablesSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Diagnostic Captioning (DC) automatically generates a diagnostic text from one or more medical images (e.g., X-rays, MRIs) of a patient. Treated as a draft, the generated text may assist clinicians, by providing an initial estimation of the patient's condition, speeding up and helping safeguard the diagnostic process. The accuracy of a diagnostic text, however, strongly depends on how well the key medical conditions depicted in the images are expressed. We propose a new data-driven guided decoding method that incorporates medical information, in the form of existing tags capturing key conditions of the image(s), into the beam search of the diagnostic text generation process. We evaluate the proposed method on two medical datasets using four DC systems that range from generic image-to-text systems with CNN encoders and RNN decoders to pre-trained Large Language Models. The latter can also be used in few- and zero-shot learning scenarios. In most cases, the proposed mechanism improves performance with respect to all evaluation measures. We provide an open-source implementation of the proposed method at this https URL.
- [599] arXiv:2406.14165 [pdf, html, other]
-
Title: Mechanism design augmented with output adviceSubjects: Computer Science and Game Theory (cs.GT)
Our work revisits the design of mechanisms via the learning-augmented framework. In this model, the algorithm is enhanced with imperfect (machine-learned) information concerning the input, usually referred to as prediction. The goal is to design algorithms whose performance degrades gently as a function of the prediction error and, in particular, perform well if the prediction is accurate, but also provide a worst-case guarantee under any possible error. This framework has been successfully applied recently to various mechanism design settings, where in most cases the mechanism is provided with a prediction about the types of the players.
We adopt a perspective in which the mechanism is provided with an output recommendation. We make no assumptions about the quality of the suggested outcome, and the goal is to use the recommendation to design mechanisms with low approximation guarantees whenever the recommended outcome is reasonable, but at the same time to provide worst-case guarantees whenever the recommendation significantly deviates from the optimal one. We propose a generic, universal measure, which we call quality of recommendation, to evaluate mechanisms across various information settings. We demonstrate how this new metric can provide refined analysis in existing results.
This model introduces new challenges, as the mechanism receives limited information comparing to settings that use predictions about the types of the agents. We study, through this lens, several well-studied mechanism design paradigms, devising new mechanisms, but also providing refined analysis for existing ones, using as a metric the quality of recommendation. We complement our positive results, by exploring the limitations of known classes of strategyproof mechanisms that can be devised using output recommendation. - [600] arXiv:2406.14167 [pdf, html, other]
-
Title: Definition generation for lexical semantic change detectionComments: Findings of ACL 2024Subjects: Computation and Language (cs.CL)
We use contextualized word definitions generated by large language models as semantic representations in the task of diachronic lexical semantic change detection (LSCD). In short, generated definitions are used as `senses', and the change score of a target word is retrieved by comparing their distributions in two time periods under comparison. On the material of five datasets and three languages, we show that generated definitions are indeed specific and general enough to convey a signal sufficient to rank sets of words by the degree of their semantic change over time. Our approach is on par with or outperforms prior non-supervised sense-based LSCD methods. At the same time, it preserves interpretability and allows to inspect the reasons behind a specific shift in terms of discrete definitions-as-senses. This is another step in the direction of explainable semantic change modeling.
- [601] arXiv:2406.14168 [pdf, html, other]
-
Title: An Asymptotic Preserving and Energy Stable Scheme for the Euler System with Congestion ConstraintSubjects: Numerical Analysis (math.NA)
In this work, we design and analyze an asymptotic preserving (AP), semi-implicit finite volume scheme for the scaled compressible isentropic Euler system with a singular pressure law known as the congestion pressure law. The congestion pressure law imposes a maximal density constraint of the form $0\leq \varrho <1$, and the scaling introduces a small parameter $\varepsilon$ in order to control the stiffness of the density constraint. As $\varepsilon\to 0$, the solutions of the compressible system converge to solutions of the so-called free-congested Euler equations that couples compressible and incompressible dynamics. We show that the proposed scheme is positivity preserving and energy stable. In addition, we also show that the numerical densities satisfy a discrete variant of the constraint. By means of extensive numerical case studies, we verify the efficacy of the scheme and show that the scheme is able to capture the two dynamics in the limiting regime, thereby proving the AP property.
- [602] arXiv:2406.14169 [pdf, html, other]
-
Title: Optimizing Novelty of Top-k Recommendations using Large Language Models and Reinforcement LearningComments: Accepted at KDD 2024Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG)
Given an input query, a recommendation model is trained using user feedback data (e.g., click data) to output a ranked list of items. In real-world systems, besides accuracy, an important consideration for a new model is novelty of its top-k recommendations w.r.t. an existing deployed model. However, novelty of top-k items is a difficult goal to optimize a model for, since it involves a non-differentiable sorting operation on the model's predictions. Moreover, novel items, by definition, do not have any user feedback data. Given the semantic capabilities of large language models, we address these problems using a reinforcement learning (RL) formulation where large language models provide feedback for the novel items. However, given millions of candidate items, the sample complexity of a standard RL algorithm can be prohibitively high. To reduce sample complexity, we reduce the top-k list reward to a set of item-wise rewards and reformulate the state space to consist of <query, item> tuples such that the action space is reduced to a binary decision; and show that this reformulation results in a significantly lower complexity when the number of items is large. We evaluate the proposed algorithm on improving novelty for a query-ad recommendation task on a large-scale search engine. Compared to supervised finetuning on recent <query, ad> pairs, the proposed RL-based algorithm leads to significant novelty gains with minimal loss in recall. We obtain similar results on the ORCAS query-webpage matching dataset and a product recommendation dataset based on Amazon reviews.
- [603] arXiv:2406.14171 [pdf, html, other]
-
Title: Ranking LLMs by compressionComments: 7 pages, 4 tablesSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
We conceptualize the process of understanding as information compression, and propose a method for ranking large language models (LLMs) based on lossless data compression. We demonstrate the equivalence of compression length under arithmetic coding with cumulative negative log probabilities when using a large language model as a prior, that is, the pre-training phase of the model is essentially the process of learning the optimal coding length. At the same time, the evaluation metric compression ratio can be obtained without actual compression, which greatly saves overhead. In this paper, we use five large language models as priors for compression, then compare their performance on challenging natural language processing tasks, including sentence completion, question answering, and coreference resolution. Experimental results show that compression ratio and model performance are positively correlated, so it can be used as a general metric to evaluate large language models.
- [604] arXiv:2406.14176 [pdf, html, other]
-
Title: A Multi-Stream Fusion Approach with One-Class Learning for Audio-Visual Deepfake DetectionSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
This paper addresses the challenge of developing a robust audio-visual deepfake detection model. In practical use cases, new generation algorithms are continually emerging, and these algorithms are not encountered during the development of detection methods. This calls for the generalization ability of the method. Additionally, to ensure the credibility of detection methods, it is beneficial for the model to interpret which cues from the video indicate it is fake. Motivated by these considerations, we then propose a multi-stream fusion approach with one-class learning as a representation-level regularization technique. We study the generalization problem of audio-visual deepfake detection by creating a new benchmark by extending and re-splitting the existing FakeAVCeleb dataset. The benchmark contains four categories of fake video(Real Audio-Fake Visual, Fake Audio-Fake Visual, Fake Audio-Real Visual, and unsynchronized video). The experimental results show that our approach improves the model's detection of unseen attacks by an average of 7.31% across four test sets, compared to the baseline model. Additionally, our proposed framework offers interpretability, indicating which modality the model identifies as fake.
- [605] arXiv:2406.14177 [pdf, html, other]
-
Title: SimulSeamless: FBK at IWSLT 2024 Simultaneous Speech TranslationSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
This paper describes the FBK's participation in the Simultaneous Translation Evaluation Campaign at IWSLT 2024. For this year's submission in the speech-to-text translation (ST) sub-track, we propose SimulSeamless, which is realized by combining AlignAtt and SeamlessM4T in its medium configuration. The SeamlessM4T model is used "off-the-shelf" and its simultaneous inference is enabled through the adoption of AlignAtt, a SimulST policy based on cross-attention that can be applied without any retraining or adaptation of the underlying model for the simultaneous task. We participated in all the Shared Task languages (English->{German, Japanese, Chinese}, and Czech->English), achieving acceptable or even better results compared to last year's submissions. SimulSeamless, covering more than 143 source languages and 200 target languages, is released at: this https URL.
- [606] arXiv:2406.14178 [pdf, html, other]
-
Title: EvSegSNN: Neuromorphic Semantic Segmentation for Event DataSubjects: Computer Vision and Pattern Recognition (cs.CV)
Semantic segmentation is an important computer vision task, particularly for scene understanding and navigation of autonomous vehicles and UAVs. Several variations of deep neural network architectures have been designed to tackle this task. However, due to their huge computational costs and their high memory consumption, these models are not meant to be deployed on resource-constrained systems. To address this limitation, we introduce an end-to-end biologically inspired semantic segmentation approach by combining Spiking Neural Networks (SNNs, a low-power alternative to classical neural networks) with event cameras whose output data can directly feed these neural network inputs. We have designed EvSegSNN, a biologically plausible encoder-decoder U-shaped architecture relying on Parametric Leaky Integrate and Fire neurons in an objective to trade-off resource usage against performance. The experiments conducted on DDD17 demonstrate that EvSegSNN outperforms the closest state-of-the-art model in terms of MIoU while reducing the number of parameters by a factor of $1.6$ and sparing a batch normalization stage.
- [607] arXiv:2406.14180 [pdf, html, other]
-
Title: RTFormer: Re-parameter TSBN Spiking TransformerSubjects: Neural and Evolutionary Computing (cs.NE)
The Spiking Neural Networks (SNNs), renowned for their bio-inspired operational mechanism and energy efficiency, mirror the human brain's neural activity. Yet, SNNs face challenges in balancing energy efficiency with the computational demands of advanced tasks. Our research introduces the RTFormer, a novel architecture that embeds Re-parameterized Temporal Sliding Batch Normalization (TSBN) within the Spiking Transformer framework. This innovation optimizes energy usage during inference while ensuring robust computational performance. The crux of RTFormer lies in its integration of reparameterized convolutions and TSBN, achieving an equilibrium between computational prowess and energy conservation.
- [608] arXiv:2406.14183 [pdf, html, other]
-
Title: Latent. Functional MapSubjects: Machine Learning (cs.LG)
Neural models learn data representations that lie on low-dimensional manifolds, yet modeling the relation between these representational spaces is an ongoing challenge. By integrating spectral geometry principles into neural modeling, we show that this problem can be better addressed in the functional domain, mitigating complexity, while enhancing interpretability and performances on downstream tasks. To this end, we introduce a multi-purpose framework to the representation learning community, which allows to: (i) compare different spaces in an interpretable way and measure their intrinsic similarity; (ii) find correspondences between them, both in unsupervised and weakly supervised settings, and (iii) to effectively transfer representations between distinct spaces. We validate our framework on various applications, ranging from stitching to retrieval tasks, demonstrating that latent functional maps can serve as a swiss-army knife for representation alignment.
- [609] arXiv:2406.14185 [pdf, html, other]
-
Title: Failure-Resilient Distributed Inference with Model Compression over Heterogeneous Edge DevicesSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
The distributed inference paradigm enables the computation workload to be distributed across multiple devices, facilitating the implementations of deep learning based intelligent services on extremely resource-constrained Internet of Things (IoT) scenarios. Yet it raises great challenges to perform complicated inference tasks relying on a cluster of IoT devices that are heterogeneous in their computing/communication capacity and prone to crash or timeout failures. In this paper, we present RoCoIn, a robust cooperative inference mechanism for locally distributed execution of deep neural network-based inference tasks over heterogeneous edge devices. It creates a set of independent and compact student models that are learned from a large model using knowledge distillation for distributed deployment. In particular, the devices are strategically grouped to redundantly deploy and execute the same student model such that the inference process is resilient to any local failures, while a joint knowledge partition and student model assignment scheme are designed to minimize the response latency of the distributed inference system in the presence of devices with diverse capacities. Extensive simulations are conducted to corroborate the superior performance of our RoCoIn for distributed inference compared to several baselines, and the results demonstrate its efficacy in timely inference and failure resiliency.
- [610] arXiv:2406.14189 [pdf, html, other]
-
Title: In Tree Structure Should Sentence Be GeneratedSubjects: Computation and Language (cs.CL)
Generative models reliant on sequential autoregression have been at the forefront of language generation for an extensive period, particularly following the introduction of widely acclaimed transformers. Despite its excellent performance, there are always some issues that we face today. For example, problems such as hallucinations and getting trapped in a logic loop may occur. To enhance the performance of existing systems, this paper introduces a new method for generating sequences in natural language, which involves generating the targeted sentence in a tree-traversing order. The paper includes an illustration of the theoretical basis and validity of the approach, as well as a comparison of its fundamentals with the diffusion model in graphic generation. Finally, a module called SenTree is introduced for generating an approximating binary tree. It is already available at this https URL. Additionally, a joint training framework based on this approach is proposed, incorporating the intrinsics of generative adversarial networks.
- [611] arXiv:2406.14190 [pdf, html, other]
-
Title: Extended Resolution Clause Learning via Dual Implication PointsSubjects: Logic in Computer Science (cs.LO)
We present a new extended resolution clause learning (ERCL) algorithm, implemented as part of a conflict-driven clause-learning (CDCL) SAT solver, wherein new variables are dynamically introduced as definitions for {\it Dual Implication Points} (DIPs) in the implication graph constructed by the solver at runtime. DIPs are generalizations of unique implication points and can be informally viewed as a pair of dominator nodes, from the decision variable at the highest decision level to the conflict node, in an implication graph. We perform extensive experimental evaluation to establish the efficacy of our ERCL method, implemented as part of the MapleLCM SAT solver and dubbed xMapleLCM, against several leading solvers including the baseline MapleLCM, as well as CDCL solvers such as Kissat 3.1.1, CryptoMiniSat 5.11, and SBVA+CaDiCaL, the winner of SAT Competition 2023. We show that xMapleLCM outperforms these solvers on Tseitin and XORified formulas. We further compare xMapleLCM with GlucoseER, a system that implements extended resolution in a different way, and provide a detailed comparative analysis of their performance.
- [612] arXiv:2406.14191 [pdf, html, other]
-
Title: Temporal Knowledge Graph Question Answering: A SurveyComments: 8 pages, 3 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Knowledge Base Question Answering (KBQA) has been a long-standing field to answer questions based on knowledge bases. Recently, the evolving dynamics of knowledge have attracted a growing interest in Temporal Knowledge Graph Question Answering (TKGQA), an emerging task to answer temporal questions. However, this field grapples with ambiguities in defining temporal questions and lacks a systematic categorization of existing methods for TKGQA. In response, this paper provides a thorough survey from two perspectives: the taxonomy of temporal questions and the methodological categorization for TKGQA. Specifically, we first establish a detailed taxonomy of temporal questions engaged in prior studies. Subsequently, we provide a comprehensive review of TKGQA techniques of two categories: semantic parsing-based and TKG embedding-based. Building on this review, the paper outlines potential research directions aimed at advancing the field of TKGQA. This work aims to serve as a comprehensive reference for TKGQA and to stimulate further research.
- [613] arXiv:2406.14192 [pdf, html, other]
-
Title: Timo: Towards Better Temporal Reasoning for Language ModelsComments: Under reviewSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Reasoning about time is essential for Large Language Models (LLMs) to understand the world. Previous works focus on solving specific tasks, primarily on time-sensitive question answering. While these methods have proven effective, they cannot generalize to a wider spectrum of temporal reasoning tasks. Therefore, we propose a crucial question: Can we build a universal framework to handle a variety of temporal reasoning tasks? To that end, we systematically study 38 temporal reasoning tasks. Based on the observation that 19 tasks are directly related to mathematics, we first leverage the available mathematical dataset to set a solid foundation for temporal reasoning. However, the in-depth study indicates that focusing solely on mathematical enhancement falls short of addressing pure temporal reasoning tasks. To mitigate this limitation, we propose a simple but effective self-critic temporal optimization method to enhance the model's temporal reasoning capabilities without sacrificing general task abilities. Finally, we develop Timo, a model designed to excel in temporal reasoning at the 7B and 13B scales. Notably, Timo outperforms the counterpart LLMs by 10.0 and 7.6 in average accuracy scores and achieves the new state-of-the-art (SOTA) performance of comparable size. Extensive experiments further validate our framework's effectiveness and its generalization across diverse temporal tasks. The code is available at this https URL.
- [614] arXiv:2406.14194 [pdf, html, other]
-
Title: VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language ModelSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
The emergence of Large Vision-Language Models (LVLMs) marks significant strides towards achieving general artificial intelligence. However, these advancements are tempered by the outputs that often reflect biases, a concern not yet extensively investigated. Existing benchmarks are not sufficiently comprehensive in evaluating biases due to their limited data scale, single questioning format and narrow sources of bias. To address this problem, we introduce VLBiasBench, a benchmark aimed at evaluating biases in LVLMs comprehensively. In VLBiasBench, we construct a dataset encompassing nine distinct categories of social biases, including age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status and two intersectional bias categories (race x gender, and race x social economic status). To create a large-scale dataset, we use Stable Diffusion XL model to generate 46,848 high-quality images, which are combined with different questions to form 128,342 samples. These questions are categorized into open and close ended types, fully considering the sources of bias and comprehensively evaluating the biases of LVLM from multiple perspectives. We subsequently conduct extensive evaluations on 15 open-source models as well as one advanced closed-source model, providing some new insights into the biases revealing from these models. Our benchmark is available at this https URL.
- [615] arXiv:2406.14197 [pdf, other]
-
Title: On the Representational Capacity of Neural Language Models with Chain-of-Thought ReasoningComments: To be published at ACL 2024Subjects: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
The performance of modern language models (LMs) has been improved by chain-of-thought (CoT) reasoning, i.e., the process of generating intermediate results that guide the model towards a final answer. A possible explanation for this improvement is that CoT reasoning extends an LM's computational power, as RNNs and transformers with additional scratch space are known to be Turing complete. Comparing LMs to Turing machines, however, introduces a category error - Turing machines decide language membership, whereas LMs define distributions over strings. To bridge this gap, we formalize CoT reasoning in a probabilistic setting. We present several results on the representational capacity of recurrent and transformer LMs with CoT reasoning, showing that they can represent the same family of distributions over strings as probabilistic Turing machines.
- [616] arXiv:2406.14201 [pdf, html, other]
-
Title: Trusting Semantic Segmentation NetworksSubjects: Computer Vision and Pattern Recognition (cs.CV)
Semantic segmentation has become an important task in computer vision with the growth of self-driving cars, medical image segmentation, etc. Although current models provide excellent results, they are still far from perfect and while there has been significant work in trying to improve the performance, both with respect to accuracy and speed of segmentation, there has been little work which analyses the failure cases of such systems. In this work, we aim to provide an analysis of how segmentation fails across different models and consider the question of whether these can be predicted reasonably at test time. To do so, we explore existing uncertainty-based metrics and see how well they correlate with misclassifications, allowing us to define the degree of trust we put in the output of our prediction models. Through several experiments on three different models across three datasets, we show that simple measures such as entropy can be used to capture misclassification with high recall rates.
- [617] arXiv:2406.14206 [pdf, html, other]
-
Title: Live Video CaptioningEduardo Blanco-Fernández, Carlos Gutiérrez-Álvarez, Nadia Nasri, Saturnino Maldonado-Bascón, Roberto J. López-SastreSubjects: Computer Vision and Pattern Recognition (cs.CV)
Dense video captioning is the task that involves the detection and description of events within video sequences. While traditional approaches focus on offline solutions where the entire video of analysis is available for the captioning model, in this work we introduce a paradigm shift towards Live Video Captioning (LVC). In LVC, dense video captioning models must generate captions for video streams in an online manner, facing important constraints such as having to work with partial observations of the video, the need for temporal anticipation and, of course, ensuring ideally a real-time response. In this work we formally introduce the novel problem of LVC and propose new evaluation metrics tailored for the online scenario, demonstrating their superiority over traditional metrics. We also propose an LVC model integrating deformable transformers and temporal filtering to address the LVC new challenges. Experimental evaluations on the ActivityNet Captions dataset validate the effectiveness of our approach, highlighting its performance in LVC compared to state-of-the-art offline methods. Results of our model as well as an evaluation kit with the novel metrics integrated are made publicly available to encourage further research on LVC.
- [618] arXiv:2406.14207 [pdf, html, other]
-
Title: LayerMatch: Do Pseudo-labels Benefit All Layers?Subjects: Machine Learning (cs.LG)
Deep neural networks have achieved remarkable performance across various tasks when supplied with large-scale labeled data. However, the collection of labeled data can be time-consuming and labor-intensive. Semi-supervised learning (SSL), particularly through pseudo-labeling algorithms that iteratively assign pseudo-labels for self-training, offers a promising solution to mitigate the dependency of labeled data. Previous research generally applies a uniform pseudo-labeling strategy across all model layers, assuming that pseudo-labels exert uniform influence throughout. Contrasting this, our theoretical analysis and empirical experiment demonstrate feature extraction layer and linear classification layer have distinct learning behaviors in response to pseudo-labels. Based on these insights, we develop two layer-specific pseudo-label strategies, termed Grad-ReLU and Avg-Clustering. Grad-ReLU mitigates the impact of noisy pseudo-labels by removing the gradient detrimental effects of pseudo-labels in the linear classification layer. Avg-Clustering accelerates the convergence of feature extraction layer towards stable clustering centers by integrating consistent outputs. Our approach, LayerMatch, which integrates these two strategies, can avoid the severe interference of noisy pseudo-labels in the linear classification layer while accelerating the clustering capability of the feature extraction layer. Through extensive experimentation, our approach consistently demonstrates exceptional performance on standard semi-supervised learning benchmarks, achieving a significant improvement of 10.38% over baseline method and a 2.44% increase compared to state-of-the-art methods.
- [619] arXiv:2406.14208 [pdf, html, other]
-
Title: SeCoKD: Aligning Large Language Models for In-Context Learning with Fewer ShotsComments: preprintSubjects: Artificial Intelligence (cs.AI)
Previous studies have shown that demonstrations can significantly help Large Language Models (LLMs ) perform better on the given tasks. However, this so-called In-Context Learning ( ICL ) ability is very sensitive to the presenting context, and often dozens of demonstrations are needed. In this work, we investigate if we can reduce the shot number while still maintaining a competitive performance. We present SeCoKD, a self-Knowledge Distillation ( KD ) training framework that aligns the student model with a heavily prompted variation, thereby increasing the utilization of a single demonstration. We experiment with the SeCoKD across three LLMs and six benchmarks focusing mainly on reasoning tasks. Results show that our method outperforms the base model and Supervised Fine-tuning ( SFT ), especially in zero-shot and one-shot settings by 30% and 10%, respectively. Moreover, SeCoKD brings little negative artifacts when evaluated on new tasks, which is more robust than Supervised Fine-tuning.
- [620] arXiv:2406.14212 [pdf, html, other]
-
Title: Discrete Virtual Rotation in Pointing vs. Leaning-Directed Steering Interfaces: A Uni vs. Bimanual PerspectiveSubjects: Human-Computer Interaction (cs.HC)
In this work, we explore the integration of discontinuous Orientation Selection into steering interfaces intending to preserve the seamless sensation of real-world movement, while mitigating the risk of inducing cybersickness. Our implementation encounters conflicts in standard input mappings, prompting us to adopt bimanual interaction as a solution. Recognizing the complexity that may arise from this step, we also develop unimanual alternatives, e.g., utilizing a Human-Joystick, commonly referred to as Leaning interface. The outcomes of an empirical study centered around a primed search task yield unexpected findings. We observed a sample of users spanning multiple levels of gaming experience and a balanced gender distribution exhibit no significant difficulties with the bimanual, asymmetric interfaces. Remarkably, the performance of Orientation Selection is, as in prior work, at least on par with Snap Rotation. Moreover, through a subsequent exploratory analysis, we uncover indications that Pointing-Directed Steering outperforms embodied interfaces in usability and task load in the given setting.
- [621] arXiv:2406.14213 [pdf, html, other]
-
Title: Complexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a TaskComments: 18 pages, 6 figures. Published in the journal Cognitive Systems Research 3 June 2022: this https URLJournal-ref: Cognitive Systems Research, Volume 75, 2022, Pages 16-24, ISSN 1389-0417Subjects: Computation and Language (cs.CL)
Even though Transformers are extensively used for Natural Language Processing tasks, especially for machine translation, they lack an explicit memory to store key concepts of processed texts. This paper explores the properties of the content of symbolic working memory added to the Transformer model decoder. Such working memory enhances the quality of model predictions in machine translation task and works as a neural-symbolic representation of information that is important for the model to make correct translations. The study of memory content revealed that translated text keywords are stored in the working memory, pointing to the relevance of memory content to the processed text. Also, the diversity of tokens and parts of speech stored in memory correlates with the complexity of the corpora for machine translation task.
- [622] arXiv:2406.14214 [pdf, html, other]
-
Title: REVEAL-IT: REinforcement learning with Visibility of Evolving Agent poLicy for InTerpretabilitySubjects: Artificial Intelligence (cs.AI)
Understanding the agent's learning process, particularly the factors that contribute to its success or failure post-training, is crucial for comprehending the rationale behind the agent's decision-making process. Prior methods clarify the learning process by creating a structural causal model (SCM) or visually representing the distribution of value functions. Nevertheless, these approaches have constraints as they exclusively function in 2D-environments or with uncomplicated transition dynamics. Understanding the agent's learning process in complicated environments or tasks is more challenging. In this paper, we propose REVEAL-IT, a novel framework for explaining the learning process of an agent in complex environments. Initially, we visualize the policy structure and the agent's learning process for various training tasks. By visualizing these findings, we can understand how much a particular training task or stage affects the agent's performance in test. Then, a GNN-based explainer learns to highlight the most important section of the policy, providing a more clear and robust explanation of the agent's learning process. The experiments demonstrate that explanations derived from this framework can effectively help in the optimization of the
- [623] arXiv:2406.14217 [pdf, html, other]
-
Title: Defending Against Sophisticated Poisoning Attacks with RL-based Aggregation in Federated LearningSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Federated learning is highly susceptible to model poisoning attacks, especially those meticulously crafted for servers. Traditional defense methods mainly focus on updating assessments or robust aggregation against manually crafted myopic attacks. When facing advanced attacks, their defense stability is notably insufficient. Therefore, it is imperative to develop adaptive defenses against such advanced poisoning attacks. We find that benign clients exhibit significantly higher data distribution stability than malicious clients in federated learning in both CV and NLP tasks. Therefore, the malicious clients can be recognized by observing the stability of their data distribution. In this paper, we propose AdaAggRL, an RL-based Adaptive Aggregation method, to defend against sophisticated poisoning attacks. Specifically, we first utilize distribution learning to simulate the clients' data distributions. Then, we use the maximum mean discrepancy (MMD) to calculate the pairwise similarity of the current local model data distribution, its historical data distribution, and global model data distribution. Finally, we use policy learning to adaptively determine the aggregation weights based on the above similarities. Experiments on four real-world datasets demonstrate that the proposed defense model significantly outperforms widely adopted defense models for sophisticated attacks.
- [624] arXiv:2406.14219 [pdf, html, other]
-
Title: Proving Olympiad Algebraic Inequalities without Human DemonstrationsComments: 36 pages, 32 figures, 2 tablesSubjects: Artificial Intelligence (cs.AI)
Solving Olympiad-level mathematical problems represents a significant advancement in machine intelligence and automated reasoning. Current machine learning methods, however, struggle to solve Olympiad-level problems beyond Euclidean plane geometry due to a lack of large-scale, high-quality datasets. The challenge is even greater in algebraic systems, which involve infinite reasoning spaces within finite conditions. To address these issues, we propose AIPS, an Algebraic Inequality Proving System capable of autonomously generating complex inequality theorems and effectively solving Olympiad-level inequality problems without requiring human demonstrations. During proof search in a mixed reasoning manner, a value curriculum learning strategy on generated datasets is implemented to improve proving performance, demonstrating strong mathematical intuitions. On a test set of 20 International Mathematical Olympiad-level inequality problems, AIPS successfully solved 10, outperforming state-of-the-art methods. Furthermore, AIPS automatically generated a vast array of non-trivial theorems without human intervention, some of which have been evaluated by professional contestants and deemed to reach the level of the International Mathematical Olympiad. Notably, one theorem was selected as a competition problem in a major city 2024 Mathematical Olympiad.
- [625] arXiv:2406.14220 [pdf, other]
-
Title: Evaluation of Deep Learning Semantic Segmentation for Land Cover Mapping on Multispectral, Hyperspectral and High Spatial Aerial ImagerySubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
In the rise of climate change, land cover mapping has become such an urgent need in environmental monitoring. The accuracy of land cover classification has gotten increasingly based on the improvement of remote sensing data. Land cover classification using satellite imageries has been explored and become more prevalent in recent years, but the methodologies remain some drawbacks of subjective and time-consuming. Some deep learning techniques have been utilized to overcome these limitations. However, most studies implemented just one image type to evaluate algorithms for land cover mapping. Therefore, our study conducted deep learning semantic segmentation in multispectral, hyperspectral, and high spatial aerial image datasets for landcover mapping. This research implemented a semantic segmentation method such as Unet, Linknet, FPN, and PSPnet for categorizing vegetation, water, and others (i.e., soil and impervious surface). The LinkNet model obtained high accuracy in IoU (Intersection Over Union) at 0.92 in all datasets, which is comparable with other mentioned techniques. In evaluation with different image types, the multispectral images showed higher performance with the IoU, and F1-score are 0.993 and 0.997, respectively. Our outcome highlighted the efficiency and broad applicability of LinkNet and multispectral image on land cover classification. This research contributes to establishing an approach on landcover segmentation via open source for long-term future application.
- [626] arXiv:2406.14226 [pdf, html, other]
-
Title: Uncertainty and Self-Supervision in Single-View DepthComments: Doctoral thesisSubjects: Computer Vision and Pattern Recognition (cs.CV)
Single-view depth estimation refers to the ability to derive three-dimensional information per pixel from a single two-dimensional image. Single-view depth estimation is an ill-posed problem because there are multiple depth solutions that explain 3D geometry from a single view. While deep neural networks have been shown to be effective at capturing depth from a single view, the majority of current methodologies are deterministic in nature. Accounting for uncertainty in the predictions can avoid disastrous consequences when applied to fields such as autonomous driving or medical robotics. We have addressed this problem by quantifying the uncertainty of supervised single-view depth for Bayesian deep neural networks. There are scenarios, especially in medicine in the case of endoscopic images, where such annotated data is not available. To alleviate the lack of data, we present a method that improves the transition from synthetic to real domain methods. We introduce an uncertainty-aware teacher-student architecture that is trained in a self-supervised manner, taking into account the teacher uncertainty. Given the vast amount of unannotated data and the challenges associated with capturing annotated depth in medical minimally invasive procedures, we advocate a fully self-supervised approach that only requires RGB images and the geometric and photometric calibration of the endoscope. In endoscopic imaging, the camera and light sources are co-located at a small distance from the target surfaces. This setup indicates that brighter areas of the image are nearer to the camera, while darker areas are further away. Building on this observation, we exploit the fact that for any given albedo and surface orientation, pixel brightness is inversely proportional to the square of the distance. We propose the use of illumination as a strong single-view self-supervisory signal for deep neural networks.
- [627] arXiv:2406.14227 [pdf, other]
-
Title: Modular Synthesis of Efficient Quantum UncomputationComments: 25 pages, 9 figuresSubjects: Programming Languages (cs.PL)
A key challenge of quantum programming is uncomputation: the reversible deallocation of qubits. And while there has been much recent progress on automating uncomputation, state-of-the-art methods are insufficient for handling today's expressive quantum programming languages. A core reason is that they operate on primitive quantum circuits, while quantum programs express computations beyond circuits, for instance, they can capture families of circuits defined recursively in terms of uncomputation and adjoints.
In this paper, we introduce the first modular automatic approach to synthesize correct and efficient uncomputation for expressive quantum programs. Our method is based on two core technical contributions: (i) an intermediate representation (IR) that can capture expressive quantum programs and comes with support for uncomputation, and (ii) modular algorithms over that IR for synthesizing uncomputation and adjoints.
We have built a complete end-to-end implementation of our method, including an implementation of the IR and the synthesis algorithms, as well as a translation from an expressive fragment of the Silq programming language to our IR and circuit generation from the IR. Our experimental evaluation demonstrates that we can handle programs beyond the capabilities of existing uncomputation approaches, while being competitive on the benchmarks they can handle. More broadly, we show that it is possible to benefit from the greater expressivity and safety offered by high-level quantum languages without sacrificing efficiency. - [628] arXiv:2406.14228 [pdf, other]
-
Title: EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary AlgorithmsComments: Work in processSubjects: Artificial Intelligence (cs.AI)
The rise of powerful large language models (LLMs) has spurred a new trend in building LLM-based autonomous agents for solving complex tasks, especially multi-agent systems. Despite the remarkable progress, we notice that existing works are heavily dependent on human-designed frameworks, which greatly limits the functional scope and scalability of agent systems. How to automatically extend the specialized agent to multi-agent systems to improve task-solving capability still remains a significant challenge. In this paper, we introduce EvoAgent, a generic method to automatically extend expert agents to multi-agent systems via the evolutionary algorithm, thereby improving the effectiveness of LLM-based agents in solving tasks. Specifically, we consider the existing agent frameworks as the initial individual and then apply a series of evolutionary operators (e.g., mutation, crossover, selection, etc.) to generate multiple agents with diverse agent settings. EvoAgent can be generalized to any LLM-based agent framework, and can automatically extend the existing agent framework to multi-agent systems without any extra human designs. Experimental results across various tasks have shown that EvoAgent can automatically generate multiple expert agents and significantly enhance the task-solving capabilities of LLM-based agents.
- [629] arXiv:2406.14230 [pdf, html, other]
-
Title: Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving TestingComments: Work in progressSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Warning: this paper contains model outputs exhibiting unethical information. Large Language Models (LLMs) have achieved significant breakthroughs, but their generated unethical content poses potential risks. Measuring value alignment of LLMs becomes crucial for their regulation and responsible deployment. Numerous datasets have been constructed to assess social bias, toxicity, and ethics in LLMs, but they suffer from evaluation chronoeffect, that is, as models rapidly evolve, existing data becomes leaked or undemanding, overestimating ever-developing LLMs. To tackle this problem, we propose GETA, a novel generative evolving testing approach that dynamically probes the underlying moral baselines of LLMs. Distinct from previous adaptive testing methods that rely on static datasets with limited difficulty, GETA incorporates an iteratively-updated item generator which infers each LLM's moral boundaries and generates difficulty-tailored testing items, accurately reflecting the true alignment extent. This process theoretically learns a joint distribution of item and model response, with item difficulty and value conformity as latent variables, where the generator co-evolves with the LLM, addressing chronoeffect. We evaluate various popular LLMs with diverse capabilities and demonstrate that GETA can create difficulty-matching testing items and more accurately assess LLMs' values, better consistent with their performance on unseen OOD and i.i.d. items, laying the groundwork for future evaluation paradigms.
- [630] arXiv:2406.14231 [pdf, html, other]
-
Title: aeon: a Python toolkit for learning from time seriesMatthew Middlehurst, Ali Ismail-Fawaz, Antoine Guillaume, Christopher Holder, David Guijo Rubio, Guzal Bulatova, Leonidas Tsaprounis, Lukasz Mentel, Martin Walter, Patrick Schäfer, Anthony BagnallComments: 10 pagesSubjects: Machine Learning (cs.LG)
aeon is a unified Python 3 library for all machine learning tasks involving time series. The package contains modules for time series forecasting, classification, extrinsic regression and clustering, as well as a variety of utilities, transformations and distance measures designed for time series data. aeon also has a number of experimental modules for tasks such as anomaly detection, similarity search and segmentation. aeon follows the scikit-learn API as much as possible to help new users and enable easy integration of aeon estimators with useful tools such as model selection and pipelines. It provides a broad library of time series algorithms, including efficient implementations of the very latest advances in research. Using a system of optional dependencies, aeon integrates a wide variety of packages into a single interface while keeping the core framework with minimal dependencies. The package is distributed under the 3-Clause BSD license and is available at this https URL aeon-toolkit/aeon. This version was submitted to the JMLR journal on 02 Nov 2023 for v0.5.0 of aeon. At the time of this preprint aeon has released v0.9.0, and has had substantial changes.
- [631] arXiv:2406.14232 [pdf, html, other]
-
Title: Enhancing robustness of data-driven SHM models: adversarial training with circle lossComments: 12 pages, 9 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Structural health monitoring (SHM) is critical to safeguarding the safety and reliability of aerospace, civil, and mechanical infrastructure. Machine learning-based data-driven approaches have gained popularity in SHM due to advancements in sensors and computational power. However, machine learning models used in SHM are vulnerable to adversarial examples -- even small changes in input can lead to different model outputs. This paper aims to address this problem by discussing adversarial defenses in SHM. In this paper, we propose an adversarial training method for defense, which uses circle loss to optimize the distance between features in training to keep examples away from the decision boundary. Through this simple yet effective constraint, our method demonstrates substantial improvements in model robustness, surpassing existing defense mechanisms.
- [632] arXiv:2406.14233 [pdf, html, other]
-
Title: Region-Specific Coarse Quantization with Check Node Awareness in 5G-LDPC DecodingComments: This paper has been submitted to IEEE Transactions on CommunicationsSubjects: Information Theory (cs.IT)
This paper presents novel techniques for improving the error correction performance and reducing the complexity of coarsely quantized 5G-LDPC decoders. The proposed decoder design supports arbitrary message-passing schedules on a base-matrix level by modeling exchanged messages with entry-specific discrete random variables. Variable nodes (VNs) and check nodes (CNs) involve compression operations designed using the information bottleneck method to maximize preserved mutual information between code bits and quantized messages. We introduce alignment regions that assign the messages to groups with aligned reliability levels to decrease the number of individual design parameters. Group compositions with degree-specific separation of messages improve performance by up to 0.4 dB. Further, we generalize our recently proposed CN-aware quantizer design to irregular LDPC codes and layered schedules. The method optimizes the VN quantizer to maximize preserved mutual information at the output of the subsequent CN update, enhancing performance by up to 0.2 dB. A schedule optimization modifies the order of layer updates, reducing the average iteration count by up to 35 %. We integrate all new techniques in a rate-compatible decoder design by extending the alignment regions along a rate-dimension. Our complexity analysis for 2-bit decoding estimates up to 64 % higher throughput versus 4-bit decoding at similar performance.
- [633] arXiv:2406.14235 [pdf, html, other]
-
Title: Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic ManipulationSubjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Learning generalizable visual dynamic representation across different embodied environments is crucial for real-world robotic manipulation. As the scale and diversity of robot demonstration data are limited, recent works have turned to large-scale pre-training using human data. However, the morphological differences between humans and robots introduce a significant human-robot domain discrepancy, challenging the generalization of these human-data pre-trained models to downstream manipulation tasks. To address this, we propose a novel adaptation paradigm that utilizes readily available paired human-robot video data to bridge the discrepancy. Following this paradigm, our method exploits a human-robot contrastive alignment loss to align the semantics of human and robot videos, adapting pre-trained models to the robotic domain in a parameter-efficient manner. The experiments demonstrate significant improvements on 25 tasks across three different benchmarks, where the single-task, language-conditioned multi-task settings are covered, and two different pre-trained models are evaluated. On the large RLBench benchmark, our adaptation method achieves an average improvement of $8.9\%$ in success rate over the pre-trained R3M model across multiple tasks. We will release the code and models upon acceptance.
- [634] arXiv:2406.14237 [pdf, other]
-
Title: Finite Alphabet Fast List Decoders for Polar CodesComments: 6 pages, 7 figures, submitted to IEEE GLOBECOM 2024Subjects: Information Theory (cs.IT)
The so-called fast polar decoding schedules are meant to improve the decoding speed of the sequential-natured successive cancellation list decoders. The decoding speedup is achieved by replacing various parts of the serial decoding process with efficient special-purpose decoder nodes. This work incorporates the fast decoding schedules for polar codes into their quantized finite alphabet decoding. In a finite alphabet successive cancellation list decoder, the log-likelihood ratio computations are replaced with lookup operations on low-resolution integer messages. The lookup tables are designed using the information bottleneck method. It is shown that the finite alphabet decoders can also leverage the special decoder nodes found in the literature. Besides their inherent decoding speed improvement, the use of these special decoder nodes drastically reduces the number of lookup tables required to perform the finite alphabet decoding. In order to perform quantized decoding using lookup operations, the proposed decoders require up to 93% less unique lookup tables as compared to the ones that use the conventional successive cancellation schedule. Moreover, the proposed decoders exhibit negligible loss in error correction performance without necessitating alterations to the lookup table design process.
- [635] arXiv:2406.14239 [pdf, html, other]
-
Title: LeYOLO, New Scalable and Efficient CNN Architecture for Object DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Computational efficiency in deep neural networks is critical for object detection, especially as newer models prioritize speed over efficient computation (FLOP). This evolution has somewhat left behind embedded and mobile-oriented AI object detection applications. In this paper, we focus on design choices of neural network architectures for efficient object detection computation based on FLOP and propose several optimizations to enhance the efficiency of YOLO-based models.
Firstly, we introduce an efficient backbone scaling inspired by inverted bottlenecks and theoretical insights from the Information Bottleneck principle. Secondly, we present the Fast Pyramidal Architecture Network (FPAN), designed to facilitate fast multiscale feature sharing while reducing computational resources. Lastly, we propose a Decoupled Network-in-Network (DNiN) detection head engineered to deliver rapid yet lightweight computations for classification and regression tasks.
Building upon these optimizations and leveraging more efficient backbones, this paper contributes to a new scaling paradigm for object detection and YOLO-centric models called LeYOLO. Our contribution consistently outperforms existing models in various resource constraints, achieving unprecedented accuracy and flop ratio. Notably, LeYOLO-Small achieves a competitive mAP score of 38.2% on the COCOval with just 4.5 FLOP(G), representing a 42% reduction in computational load compared to the latest state-of-the-art YOLOv9-Tiny model while achieving similar accuracy. Our novel model family achieves a FLOP-to-accuracy ratio previously unattained, offering scalability that spans from ultra-low neural network configurations (< 1 GFLOP) to efficient yet demanding object detection setups (> 4 GFLOPs) with 25.2, 31.3, 35.2, 38.2, 39.3 and 41 mAP for 0.66, 1.47, 2.53, 4.51, 5.8 and 8.4 FLOP(G). - [636] arXiv:2406.14240 [pdf, other]
-
Title: CityNav: Language-Goal Aerial Navigation Dataset with Geographic InformationJungdae Lee, Taiki Miyanishi, Shuhei Kurita, Koya Sakamoto, Daichi Azuma, Yutaka Matsuo, Nakamasa InoueComments: The first two authors are equally contributedSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Vision-and-language navigation (VLN) aims to guide autonomous agents through real-world environments by integrating visual and linguistic cues. While substantial progress has been made in understanding these interactive modalities in ground-level navigation, aerial navigation remains largely underexplored. This is primarily due to the scarcity of resources suitable for real-world, city-scale aerial navigation studies. To bridge this gap, we introduce CityNav, a new dataset for language-goal aerial navigation using a 3D point cloud representation from real-world cities. CityNav includes 32,637 natural language descriptions paired with human demonstration trajectories, collected from participants via a new web-based 3D simulator developed for this research. Each description specifies a navigation goal, leveraging the names and locations of landmarks within real-world cities. We also provide baseline models of navigation agents that incorporate an internal 2D spatial map representing landmarks referenced in the descriptions. We benchmark the latest aerial navigation baselines and our proposed model on the CityNav dataset. The results using this dataset reveal the following key findings: (i) Our aerial agent models trained on human demonstration trajectories outperform those trained on shortest path trajectories, highlighting the importance of human-driven navigation strategies; (ii) The integration of a 2D spatial map significantly enhances navigation efficiency at city scale. Our dataset and code are available at this https URL
- [637] arXiv:2406.14243 [pdf, html, other]
-
Title: AuditMAI: Towards An Infrastructure for Continuous AI AuditingSubjects: Computers and Society (cs.CY)
Artificial Intelligence (AI) Auditability is a core requirement for achieving responsible AI system design. However, it is not yet a prominent design feature in current applications. Existing AI auditing tools typically lack integration features and remain as isolated approaches. This results in manual, high-effort, and mostly one-off AI audits, necessitating alternative methods. Inspired by other domains such as finance, continuous AI auditing is a promising direction to conduct regular assessments of AI systems. The issue remains, however, since the methods for continuous AI auditing are not mature yet at the moment. To address this gap, we propose the Auditability Method for AI (AuditMAI), which is intended as a blueprint for an infrastructure towards continuous AI auditing. For this purpose, we first clarified the definition of AI auditability based on literature. Secondly, we derived requirements from two industrial use cases for continuous AI auditing tool support. Finally, we developed AuditMAI and discussed its elements as a blueprint for a continuous AI auditability infrastructure.
- [638] arXiv:2406.14245 [pdf, html, other]
-
Title: On countering adversarial perturbations in graphs using error correcting codesSubjects: Cryptography and Security (cs.CR); Data Analysis, Statistics and Probability (physics.data-an)
We consider the problem of a graph subjected to adversarial perturbations, such as those arising from cyber-attacks, where edges are covertly added or removed. The adversarial perturbations occur during the transmission of the graph between a sender and a receiver. To counteract potential perturbations, we explore a repetition coding scheme with sender-assigned binary noise and majority voting on the receiver's end to rectify the graph's structure. Our approach operates without prior knowledge of the attack's characteristics. We provide an analytical derivation of a bound on the number of repetitions needed to satisfy probabilistic constraints on the quality of the reconstructed graph. We show that the method can accurately decode graphs that were subjected to non-random edge removal, namely, those connected to vertices with the highest eigenvector centrality, in addition to random addition and removal of edges by the attacker.
- [639] arXiv:2406.14250 [pdf, html, other]
-
Title: E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTionComments: 9 pages, 5 figures, Under review at ARR (for EMNLP 2024)Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Online GUI navigation on mobile devices has driven a lot of attention recent years since it contributes to many real-world applications. With the rapid development of large language models (LLM), multimodal large language models (MLLM) have tremendous potential on this task. However, existing MLLMs need high quality data to improve its abilities of making the correct navigation decisions according to the human user inputs. In this paper, we developed a novel and highly valuable dataset, named \textbf{E-ANT}, as the first Chinese GUI navigation dataset that contains real human behaviour and high quality screenshots with annotations, containing nearly 40,000 real human traces over 5000+ different tinyAPPs. Furthermore, we evaluate various powerful MLLMs on E-ANT and show their experiments results with sufficient ablations. We believe that our proposed dataset will be beneficial for both the evaluation and development of GUI navigation and LLM/MLLM decision-making capabilities.
- [640] arXiv:2406.14251 [pdf, html, other]
-
Title: Enhanced Optimal Power Flow Based Droop Control in MMC-MTDC SystemsSubjects: Systems and Control (eess.SY)
Optimizing operational set points for modular multilevel converters (MMCs) in Multi-Terminal Direct Current (MTDC) transmission systems is crucial for ensuring efficient power distribution and control. This paper presents an enhanced Optimal Power Flow (OPF) model for MMC-MTDC systems, integrating a novel adaptive voltage droop control strategy. The strategy aims to minimize generation costs and DC voltage deviations while ensuring the stable operation of the MTDC grid by dynamically adjusting the system operation points. The modified Nordic 32 test system with an embedded 4-terminal DC grid is modeled in Julia and the proposed control strategy is applied to the power model. The results demonstrate the feasibility and effectiveness of the proposed droop control strategy, affirming its potential value in enhancing the performance and reliability of hybrid AC-DC power systems.
- [641] arXiv:2406.14255 [pdf, html, other]
-
Title: DuMapNet: An End-to-End Vectorization System for City-Scale Lane-Level Map GenerationDeguo Xia, Weiming Zhang, Xiyan Liu, Wei Zhang, Chenting Gong, Jizhou Huang, Mengmeng Yang, Diange YangComments: Accepted by KDD 2024, camera-ready versionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Generating city-scale lane-level maps faces significant challenges due to the intricate urban environments, such as blurred or absent lane markings. Additionally, a standard lane-level map requires a comprehensive organization of lane groupings, encompassing lane direction, style, boundary, and topology, yet has not been thoroughly examined in prior research. These obstacles result in labor-intensive human annotation and high maintenance costs. This paper overcomes these limitations and presents an industrial-grade solution named DuMapNet that outputs standardized, vectorized map elements and their topology in an end-to-end paradigm. To this end, we propose a group-wise lane prediction (GLP) system that outputs vectorized results of lane groups by meticulously tailoring a transformer-based network. Meanwhile, to enhance generalization in challenging scenarios, such as road wear and occlusions, as well as to improve global consistency, a contextual prompts encoder (CPE) module is proposed, which leverages the predicted results of spatial neighborhoods as contextual information. Extensive experiments conducted on large-scale real-world datasets demonstrate the superiority and effectiveness of DuMapNet. Additionally, DuMap-Net has already been deployed in production at Baidu Maps since June 2023, supporting lane-level map generation tasks for over 360 cities while bringing a 95% reduction in costs. This demonstrates that DuMapNet serves as a practical and cost-effective industrial solution for city-scale lane-level map generation.
- [642] arXiv:2406.14259 [pdf, html, other]
-
Title: MEAT: Median-Ensemble Adversarial Training for Improving Robustness and GeneralizationSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Self-ensemble adversarial training methods improve model robustness by ensembling models at different training epochs, such as model weight averaging (WA). However, previous research has shown that self-ensemble defense methods in adversarial training (AT) still suffer from robust overfitting, which severely affects the generalization performance. Empirically, in the late phases of training, the AT becomes more overfitting to the extent that the individuals for weight averaging also suffer from overfitting and produce anomalous weight values, which causes the self-ensemble model to continue to undergo robust overfitting due to the failure in removing the weight anomalies. To solve this problem, we aim to tackle the influence of outliers in the weight space in this work and propose an easy-to-operate and effective Median-Ensemble Adversarial Training (MEAT) method to solve the robust overfitting phenomenon existing in self-ensemble defense from the source by searching for the median of the historical model weights. Experimental results show that MEAT achieves the best robustness against the powerful AutoAttack and can effectively allievate the robust overfitting. We further demonstrate that most defense methods can improve robust generalization and robustness by combining with MEAT.
- [643] arXiv:2406.14261 [pdf, html, other]
-
Title: Unleashing the Potential of Tracklets for Unsupervised Video Person Re-IdentificationComments: The first two authors contributed equallySubjects: Computer Vision and Pattern Recognition (cs.CV)
With rich temporal-spatial information, video-based person re-identification methods have shown broad prospects. Although tracklets can be easily obtained with ready-made tracking models, annotating identities is still expensive and impractical. Therefore, some video-based methods propose using only a few identity annotations or camera labels to facilitate feature learning. They also simply average the frame features of each tracklet, overlooking unexpected variations and inherent identity consistency within tracklets. In this paper, we propose the Self-Supervised Refined Clustering (SSR-C) framework without relying on any annotation or auxiliary information to promote unsupervised video person re-identification. Specifically, we first propose the Noise-Filtered Tracklet Partition (NFTP) module to reduce the feature bias of tracklets caused by noisy tracking results, and sequentially partition the noise-filtered tracklets into "sub-tracklets". Then, we cluster and further merge sub-tracklets using the self-supervised signal from tracklet partition, which is enhanced through a progressive strategy to generate reliable pseudo labels, facilitating intra-class cross-tracklet aggregation. Moreover, we propose the Class Smoothing Classification (CSC) loss to efficiently promote model learning. Extensive experiments on the MARS and DukeMTMC-VideoReID datasets demonstrate that our proposed SSR-C for unsupervised video person re-identification achieves state-of-the-art results and is comparable to advanced supervised methods.
- [644] arXiv:2406.14263 [pdf, html, other]
-
Title: Scalable and RISC-V Programmable Near-Memory Computing Architectures for Edge NodesMichele Caon (1), Clément Choné (2), Pasquale Davide Schiavone (2), Alexandre Levisse (2), Guido Masera (1), Maurizio Martina (1), David Atienza (2) ((1) Politecnico di Torino, (2) École Polytechnique Fédérale de Lausanne (EPFL))Comments: 14 pages, 12 figures, submitted to IEEE Transactions on Emerging Topics in ComputingSubjects: Hardware Architecture (cs.AR)
The widespread adoption of data-centric algorithms, particularly Artificial Intelligence (AI) and Machine Learning (ML), has exposed the limitations of centralized processing infrastructures, driving a shift towards edge computing. This necessitates stringent constraints on energy efficiency, which traditional von Neumann architectures struggle to meet. The Compute-In-Memory (CIM) paradigm has emerged as a superior candidate due to its efficient exploitation of available memory bandwidth. However, existing CIM solutions require high implementation effort and lack flexibility from a software integration standpoint. This work proposes a novel, software-friendly, general-purpose, and low-integration-effort Near-Memory Computing (NMC) approach, paving the way for the adoption of CIM-based systems in the next generation of edge computing nodes. Two architectural variants, NM-Caesar and NM-Carus, are proposed and characterized to target different trade-offs in area efficiency, performance, and flexibility, covering a wide range of embedded microcontrollers. Post-layout simulations show up to $25.8\times$ and $50.0\times$ lower execution time and $23.2\times$ and $33.1\times$ higher energy efficiency at the system level, respectively, compared to executing the same tasks on a state-of-the-art RISC-V CPU (RV32IMC). NM-Carus achieves a peak energy efficiency of $306.7$ GOPS/W in 8-bit matrix multiplications, surpassing recent state-of-the-art in- and near-memory circuits.
- [645] arXiv:2406.14265 [pdf, html, other]
-
Title: VeriFlow: Modeling Distributions for Neural Network VerificationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Symbolic Computation (cs.SC)
Formal verification has emerged as a promising method to ensure the safety and reliability of neural networks. Naively verifying a safety property amounts to ensuring the safety of a neural network for the whole input space irrespective of any training or test set. However, this also implies that the safety of the neural network is checked even for inputs that do not occur in the real-world and have no meaning at all, often resulting in spurious errors. To tackle this shortcoming, we propose the VeriFlow architecture as a flow based density model tailored to allow any verification approach to restrict its search to the some data distribution of interest. We argue that our architecture is particularly well suited for this purpose because of two major properties. First, we show that the transformation and log-density function that are defined by our model are piece-wise affine. Therefore, the model allows the usage of verifiers based on SMT with linear arithmetic. Second, upper density level sets (UDL) of the data distribution take the shape of an $L^p$-ball in the latent space. As a consequence, representations of UDLs specified by a given probability are effectively computable in latent space. This allows for SMT and abstract interpretation approaches with fine-grained, probabilistically interpretable, control regarding on how (a)typical the inputs subject to verification are.
- [646] arXiv:2406.14266 [pdf, html, other]
-
Title: Intelligent Interface: Enhancing Lecture Engagement with Didactic Activity SummariesAnna Wróblewska, Marcel Witas, Kinga Frańczak, Arkadiusz Kniaź, Siew Ann Cheong, Tan Seng Chee, Janusz Hołyst, Marcin PaprzyckiComments: 9 pages, 6 figuresSubjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Recently, multiple applications of machine learning have been introduced. They include various possibilities arising when image analysis methods are applied to, broadly understood, video streams. In this context, a novel tool, developed for academic educators to enhance the teaching process by automating, summarizing, and offering prompt feedback on conducting lectures, has been developed. The implemented prototype utilizes machine learning-based techniques to recognise selected didactic and behavioural teachers' features within lecture video recordings.
Specifically, users (teachers) can upload their lecture videos, which are preprocessed and analysed using machine learning models. Next, users can view summaries of recognized didactic features through interactive charts and tables. Additionally, stored ML-based prediction results support comparisons between lectures based on their didactic content. In the developed application text-based models trained on lecture transcriptions, with enhancements to the transcription quality, by adopting an automatic speech recognition solution are applied. Furthermore, the system offers flexibility for (future) integration of new/additional machine-learning models and software modules for image and video analysis. - [647] arXiv:2406.14267 [pdf, html, other]
-
Title: On the Evaluation Practices in Multilingual NLP: Can Machine Translation Offer an Alternative to Human Translations?Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
While multilingual language models (MLMs) have been trained on 100+ languages, they are typically only evaluated across a handful of them due to a lack of available test data in most languages. This is particularly problematic when assessing MLM's potential for low-resource and unseen languages. In this paper, we present an analysis of existing evaluation frameworks in multilingual NLP, discuss their limitations, and propose several directions for more robust and reliable evaluation practices. Furthermore, we empirically study to what extent machine translation offers a {reliable alternative to human translation} for large-scale evaluation of MLMs across a wide set of languages. We use a SOTA translation model to translate test data from 4 tasks to 198 languages and use them to evaluate three MLMs. We show that while the selected subsets of high-resource test languages are generally sufficiently representative of a wider range of high-resource languages, we tend to overestimate MLMs' ability on low-resource languages. Finally, we show that simpler baselines can achieve relatively strong performance without having benefited from large-scale multilingual pretraining.
- [648] arXiv:2406.14272 [pdf, html, other]
-
Title: MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video DatasetComments: Interspeech 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Recent studies in speech-driven 3D talking head generation have achieved convincing results in verbal articulations. However, generating accurate lip-syncs degrades when applied to input speech in other languages, possibly due to the lack of datasets covering a broad spectrum of facial movements across languages. In this work, we introduce a novel task to generate 3D talking heads from speeches of diverse languages. We collect a new multilingual 2D video dataset comprising over 420 hours of talking videos in 20 languages. With our proposed dataset, we present a multilingually enhanced model that incorporates language-specific style embeddings, enabling it to capture the unique mouth movements associated with each language. Additionally, we present a metric for assessing lip-sync accuracy in multilingual settings. We demonstrate that training a 3D talking head model with our proposed dataset significantly enhances its multilingual performance. Codes and datasets are available at this https URL.
- [649] arXiv:2406.14273 [pdf, html, other]
-
Title: The Impact of AI on Perceived Job Decency and Meaningfulness: A Case StudySubjects: Artificial Intelligence (cs.AI)
The proliferation of Artificial Intelligence (AI) in workplaces stands to change the way humans work, with job satisfaction intrinsically linked to work life. Existing research on human-AI collaboration tends to prioritize performance over the experiential aspects of work. In contrast, this paper explores the impact of AI on job decency and meaningfulness in workplaces. Through interviews in the Information Technology (IT) domain, we not only examined the current work environment, but also explored the perceived evolution of the workplace ecosystem with the introduction of an AI. Findings from the preliminary exploratory study reveal that respondents tend to visualize a workplace where humans continue to play a dominant role, even with the introduction of advanced AIs. In this prospective scenario, AI is seen as serving as a complement rather than replacing the human workforce. Furthermore, respondents believe that the introduction of AI will maintain or potentially increase overall job satisfaction.
- [650] arXiv:2406.14274 [pdf, html, other]
-
Title: Learning to Discover Knowledge: A Weakly-Supervised Partial Domain Adaptation ApproachComments: Accepted to TIP 2024. Code available: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Domain adaptation has shown appealing performance by leveraging knowledge from a source domain with rich annotations. However, for a specific target task, it is cumbersome to collect related and high-quality source domains. In real-world scenarios, large-scale datasets corrupted with noisy labels are easy to collect, stimulating a great demand for automatic recognition in a generalized setting, i.e., weakly-supervised partial domain adaptation (WS-PDA), which transfers a classifier from a large source domain with noises in labels to a small unlabeled target domain. As such, the key issues of WS-PDA are: 1) how to sufficiently discover the knowledge from the noisy labeled source domain and the unlabeled target domain, and 2) how to successfully adapt the knowledge across domains. In this paper, we propose a simple yet effective domain adaptation approach, termed as self-paced transfer classifier learning (SP-TCL), to address the above issues, which could be regarded as a well-performing baseline for several generalized domain adaptation tasks. The proposed model is established upon the self-paced learning scheme, seeking a preferable classifier for the target domain. Specifically, SP-TCL learns to discover faithful knowledge via a carefully designed prudent loss function and simultaneously adapts the learned knowledge to the target domain by iteratively excluding source examples from training under the self-paced fashion. Extensive evaluations on several benchmark datasets demonstrate that SP-TCL significantly outperforms state-of-the-art approaches on several generalized domain adaptation tasks.
- [651] arXiv:2406.14275 [pdf, html, other]
-
Title: Step-Back Profiling: Distilling User History for Personalized Scientific WritingXiangru Tang, Xingyao Zhang, Yanjun Shao, Jie Wu, Yilun Zhao, Arman Cohan, Ming Gong, Dongmei Zhang, Mark GersteinSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large language models (LLMs) excel at a variety of natural language processing tasks, yet they struggle to generate personalized content for individuals, particularly in real-world scenarios like scientific writing. Addressing this challenge, we introduce Step-Back Profiling to personalize LLMs by distilling user history into concise profiles, including essential traits and preferences of users. Regarding our experiments, we construct a Personalized Scientific Writing (PSW) dataset to study multiuser personalization. PSW requires the models to write scientific papers given specialized author groups with diverse academic backgrounds. As for the results, we demonstrate the effectiveness of capturing user characteristics via Step-Back Profiling for collaborative writing. Moreover, our approach outperforms the baselines by up to 3.6 points on the general personalization benchmark (LaMP), including 7 personalization LLM tasks. Our extensive ablation studies validate the contributions of different components in our method and provide insights into our task definition. Our dataset and code are available at \url{this https URL}.
- [652] arXiv:2406.14277 [pdf, html, other]
-
Title: Augmenting Query and Passage for Retrieval-Augmented Generation using LLMs for Open-Domain Question AnsweringSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Retrieval-augmented generation (RAG) has received much attention for Open-domain question-answering (ODQA) tasks as a means to compensate for the parametric knowledge of large language models (LLMs). While previous approaches focused on processing retrieved passages to remove irrelevant context, they still rely heavily on the quality of retrieved passages which can degrade if the question is ambiguous or complex. In this paper, we propose a simple yet efficient method called question and passage augmentation via LLMs for open-domain QA. Our method first decomposes the original questions into multiple-step sub-questions. By augmenting the original question with detailed sub-questions and planning, we are able to make the query more specific on what needs to be retrieved, improving the retrieval performance. In addition, to compensate for the case where the retrieved passages contain distracting information or divided opinions, we augment the retrieved passages with self-generated passages by LLMs to guide the answer extraction. Experimental results show that the proposed scheme outperforms the previous state-of-the-art and achieves significant performance gain over existing RAG methods.
- [653] arXiv:2406.14278 [pdf, html, other]
-
Title: Efficient Deterministic Algorithms for Maximizing Symmetric Submodular FunctionsSubjects: Data Structures and Algorithms (cs.DS)
Symmetric submodular maximization is an important class of combinatorial optimization problems, including MAX-CUT on graphs and hyper-graphs. The state-of-the-art algorithm for the problem over general constraints has an approximation ratio of $0.432$. The algorithm applies the canonical continuous greedy technique that involves a sampling process. It, therefore, suffers from high query complexity and is inherently randomized. In this paper, we present several efficient deterministic algorithms for maximizing a symmetric submodular function under various constraints. Specifically, for the cardinality constraint, we design a deterministic algorithm that attains a $0.432$ ratio and uses $O(kn)$ queries. Previously, the best deterministic algorithm attains a $0.385-\epsilon$ ratio and uses $O\left(kn (\frac{10}{9\epsilon})^{\frac{20}{9\epsilon}-1}\right)$ queries. For the matroid constraint, we design a deterministic algorithm that attains a $1/3-\epsilon$ ratio and uses $O(kn\log \epsilon^{-1})$ queries. Previously, the best deterministic algorithm can also attain $1/3-\epsilon$ ratio but it uses much larger $O(\epsilon^{-1}n^4)$ queries. For the packing constraints with a large width, we design a deterministic algorithm that attains a $0.432-\epsilon$ ratio and uses $O(n^2)$ queries. To the best of our knowledge, there is no deterministic algorithm for the constraint previously. The last algorithm can be adapted to attain a $0.432$ ratio for single knapsack constraint using $O(n^4)$ queries. Previously, the best deterministic algorithm attains a $0.316-\epsilon$ ratio and uses $\widetilde{O}(n^3)$ queries.
- [654] arXiv:2406.14281 [pdf, html, other]
-
Title: FairX: A comprehensive benchmarking tool for model analysis using fairness, utility, and explainabilitySubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
We present FairX, an open-source Python-based benchmarking tool designed for the comprehensive analysis of models under the umbrella of fairness, utility, and eXplainability (XAI). FairX enables users to train benchmarking bias-removal models and evaluate their fairness using a wide array of fairness metrics, data utility metrics, and generate explanations for model predictions, all within a unified framework. Existing benchmarking tools do not have the way to evaluate synthetic data generated from fair generative models, also they do not have the support for training fair generative models either. In FairX, we add fair generative models in the collection of our fair-model library (pre-processing, in-processing, post-processing) and evaluation metrics for evaluating the quality of synthetic fair data. This version of FairX supports both tabular and image datasets. It also allows users to provide their own custom datasets. The open-source FairX benchmarking package is publicly available at this https URL.
- [655] arXiv:2406.14282 [pdf, html, other]
-
Title: Learning to Plan for Retrieval-Augmented Large Language Models from Knowledge GraphsJunjie Wang, Mingyang Chen, Binbin Hu, Dan Yang, Ziqi Liu, Yue Shen, Peng Wei, Zhiqiang Zhang, Jinjie Gu, Jun Zhou, Jeff Z. Pan, Wen Zhang, Huajun ChenComments: Work in progressSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Improving the performance of large language models (LLMs) in complex question-answering (QA) scenarios has always been a research focal point. Recent studies have attempted to enhance LLMs' performance by combining step-wise planning with external retrieval. While effective for advanced models like GPT-3.5, smaller LLMs face challenges in decomposing complex questions, necessitating supervised fine-tuning. Previous work has relied on manual annotation and knowledge distillation from teacher LLMs, which are time-consuming and not accurate enough. In this paper, we introduce a novel framework for enhancing LLMs' planning capabilities by using planning data derived from knowledge graphs (KGs). LLMs fine-tuned with this data have improved planning capabilities, better equipping them to handle complex QA tasks that involve retrieval. Evaluations on multiple datasets, including our newly proposed benchmark, highlight the effectiveness of our framework and the benefits of KG-derived planning data.
- [656] arXiv:2406.14283 [pdf, html, other]
-
Title: Q*: Improving Multi-step Reasoning for LLMs with Deliberative PlanningSubjects: Artificial Intelligence (cs.AI)
Large Language Models (LLMs) have demonstrated impressive capability in many nature language tasks. However, the auto-regressive generation process makes LLMs prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. In this paper, we aim to alleviate the pathology by introducing Q*, a general, versatile and agile framework for guiding LLMs decoding process with deliberative planning. By learning a plug-and-play Q-value model as heuristic function, our Q* can effectively guide LLMs to select the most promising next step without fine-tuning LLMs for each task, which avoids the significant computational overhead and potential risk of performance degeneration on other tasks. Extensive experiments on GSM8K, MATH and MBPP confirm the superiority of our method.
- [657] arXiv:2406.14284 [pdf, other]
-
Title: VAIYAKARANA : A Benchmark for Automatic Grammar Correction in BanglaSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Bangla (Bengali) is the fifth most spoken language globally and, yet, the problem of automatic grammar correction in Bangla is still in its nascent stage. This is mostly due to the need for a large corpus of grammatically incorrect sentences, with their corresponding correct counterparts. The present state-of-the-art techniques to curate a corpus for grammatically wrong sentences involve random swapping, insertion and deletion of words. However,these steps may not always generate grammatically wrong sentences in Bangla. In this work, we propose a pragmatic approach to generate grammatically wrong sentences in Bangla. We first categorize the different kinds of errors in Bangla into 5 broad classes and 12 finer classes. We then use these to generate grammatically wrong sentences systematically from a correct sentence. This approach can generate a large number of wrong sentences and can, thus, mitigate the challenge of lacking a large corpus for neural networks. We provide a dataset, Vaiyakarana, consisting of 92,830 grammatically incorrect sentences as well as 18,426 correct sentences. We also collected 619 human-generated sentences from essays written by Bangla native speakers. This helped us to understand errors that are more frequent. We evaluated our corpus against neural models and LLMs and also benchmark it against human evaluators who are native speakers of Bangla. Our analysis shows that native speakers are far more accurate than state-of-the-art models to detect whether the sentence is grammatically correct. Our methodology of generating erroneous sentences can be applied for most other Indian languages as well.
- [658] arXiv:2406.14288 [pdf, html, other]
-
Title: Revisiting Modularity Maximization for Graph Clustering: A Contrastive Learning PerspectiveYunfei Liu, Jintang Li, Yuehe Chen, Ruofan Wu, Ericbk Wang, Jing Zhou, Sheng Tian, Shuheng Shen, Xing Fu, Changhua Meng, Weiqiang Wang, Liang ChenComments: KDD 2024 research track. Code available at this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Graph clustering, a fundamental and challenging task in graph mining, aims to classify nodes in a graph into several disjoint clusters. In recent years, graph contrastive learning (GCL) has emerged as a dominant line of research in graph clustering and advances the new state-of-the-art. However, GCL-based methods heavily rely on graph augmentations and contrastive schemes, which may potentially introduce challenges such as semantic drift and scalability issues. Another promising line of research involves the adoption of modularity maximization, a popular and effective measure for community detection, as the guiding principle for clustering tasks. Despite the recent progress, the underlying mechanism of modularity maximization is still not well understood. In this work, we dig into the hidden success of modularity maximization for graph clustering. Our analysis reveals the strong connections between modularity maximization and graph contrastive learning, where positive and negative examples are naturally defined by modularity. In light of our results, we propose a community-aware graph clustering framework, coined MAGI, which leverages modularity maximization as a contrastive pretext task to effectively uncover the underlying information of communities in graphs, while avoiding the problem of semantic drift. Extensive experiments on multiple graph datasets verify the effectiveness of MAGI in terms of scalability and clustering performance compared to state-of-the-art graph clustering methods. Notably, MAGI easily scales a sufficiently large graph with 100M nodes while outperforming strong baselines.
- [659] arXiv:2406.14290 [pdf, html, other]
-
Title: Examining the Implications of Deepfakes for Election IntegrityHriday Ranka, Mokshit Surana, Neel Kothari, Veer Pariawala, Pratyay Banerjee, Aditya Surve, Sainath Reddy Sankepally, Raghav Jain, Jhagrut Lalwani, Swapneel MehtaComments: Accepted at the AAAI 2024 conference, AI for Credible Elections Workshop-AI4CE 2024Subjects: Computers and Society (cs.CY); Social and Information Networks (cs.SI)
It is becoming cheaper to launch disinformation operations at scale using AI-generated content, in particular 'deepfake' technology. We have observed instances of deepfakes in political campaigns, where generated content is employed to both bolster the credibility of certain narratives (reinforcing outcomes) and manipulate public perception to the detriment of targeted candidates or causes (adversarial outcomes). We discuss the threats from deepfakes in politics, highlight model specifications underlying different types of deepfake generation methods, and contribute an accessible evaluation of the efficacy of existing detection methods. We provide this as a summary for lawmakers and civil society actors to understand how the technology may be applied in light of existing policies regulating its use. We highlight the limitations of existing detection mechanisms and discuss the areas where policies and regulations are required to address the challenges of deepfakes.
- [660] arXiv:2406.14294 [pdf, html, other]
-
Title: DASB -- Discrete Audio and Speech BenchmarkComments: 9 pages, 5 tablesSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Discrete audio tokens have recently gained considerable attention for their potential to connect audio and language processing, enabling the creation of modern multimodal large language models. Ideal audio tokens must effectively preserve phonetic and semantic content along with paralinguistic information, speaker identity, and other details. While several types of audio tokens have been recently proposed, identifying the optimal tokenizer for various tasks is challenging due to the inconsistent evaluation settings in existing studies. To address this gap, we release the Discrete Audio and Speech Benchmark (DASB), a comprehensive leaderboard for benchmarking discrete audio tokens across a wide range of discriminative tasks, including speech recognition, speaker identification and verification, emotion recognition, keyword spotting, and intent classification, as well as generative tasks such as speech enhancement, separation, and text-to-speech. Our results show that, on average, semantic tokens outperform compression tokens across most discriminative and generative tasks. However, the performance gap between semantic tokens and standard continuous representations remains substantial, highlighting the need for further research in this field.
- [661] arXiv:2406.14295 [pdf, other]
-
Title: Crowdfunding for Equitable EV Charging InfrastructureSubjects: Computational Engineering, Finance, and Science (cs.CE)
The transportation sector significantly contributes to greenhouse gas emissions, highlighting the need to transition to Electric Vehicles (EVs) to reduce fossil fuel dependence and combat climate change. The US government has set ambitious targets for 2030, aiming for half of all new vehicles sold to be zero-emissions. Expanding EV charging stations is crucial for this transition, but social equity presents a significant challenge. The Justice40 program mandates that at least 40% of benefits be allocated to disadvantaged communities, ensuring they benefit from federal investments. Given the current concentration of EV ownership in affluent areas, merely installing charging stations in disadvantaged neighborhoods may not suffice. This article explores crowdfunding as a novel method to finance EV charging infrastructure, engaging, and empowering underserved communities. The paper concludes with a hypothetical case showing financing benefits for disadvantaged communities, exploring crowdfunding variations, and scaling to develop equitable EV charging networks.
- [662] arXiv:2406.14297 [pdf, html, other]
-
Title: AI in Space for Scientific Missions: Strategies for Minimizing Neural-Network Model UploadJonah Ekelund, Ricardo Vinuesa, Yuri Khotyaintsev, Pierre Henri, Gian Luca Delzanno, Stefano MarkidisSubjects: Artificial Intelligence (cs.AI); Instrumentation and Methods for Astrophysics (astro-ph.IM)
Artificial Intelligence (AI) has the potential to revolutionize space exploration by delegating several spacecraft decisions to an onboard AI instead of relying on ground control and predefined procedures. It is likely that there will be an AI/ML Processing Unit onboard the spacecraft running an inference engine. The neural-network will have pre-installed parameters that can be updated onboard by uploading, by telecommands, parameters obtained by training on the ground. However, satellite uplinks have limited bandwidth and transmissions can be costly. Furthermore, a mission operating with a suboptimal neural network will miss out on valuable scientific data. Smaller networks can thereby decrease the uplink cost, while increasing the value of the scientific data that is downloaded. In this work, we evaluate and discuss the use of reduced-precision and bare-minimum neural networks to reduce the time for upload. As an example of an AI use case, we focus on the NASA's Magnetosperic MultiScale (MMS) mission. We show how an AI onboard could be used in the Earth's magnetosphere to classify data to selectively downlink higher value data or to recognize a region-of-interest to trigger a burst-mode, collecting data at a high-rate. Using a simple filtering scheme and algorithm, we show how the start and end of a region-of-interest can be detected in on a stream of classifications. To provide the classifications, we use an established Convolutional Neural Network (CNN) trained to an accuracy >94%. We also show how the network can be reduced to a single linear layer and trained to the same accuracy as the established CNN. Thereby, reducing the overall size of the model by up to 98.9%. We further show how each network can be reduced by up to 75% of its original size, by using lower-precision formats to represent the network parameters, with a change in accuracy of less than 0.6 percentage points.
- [663] arXiv:2406.14301 [pdf, html, other]
-
Title: Resource Optimization for Tail-Based Control in Wireless Networked Control SystemsComments: Accepted in PIMRC 2024 conference, 6 pages, 5 figuresSubjects: Systems and Control (eess.SY); Machine Learning (cs.LG)
Achieving control stability is one of the key design challenges of scalable Wireless Networked Control Systems (WNCS) under limited communication and computing resources. This paper explores the use of an alternative control concept defined as tail-based control, which extends the classical Linear Quadratic Regulator (LQR) cost function for multiple dynamic control systems over a shared wireless network. We cast the control of multiple control systems as a network-wide optimization problem and decouple it in terms of sensor scheduling, plant state prediction, and control policies. Toward this, we propose a solution consisting of a scheduling algorithm based on Lyapunov optimization for sensing, a mechanism based on Gaussian Process Regression (GPR) for state prediction and uncertainty estimation, and a control policy based on Reinforcement Learning (RL) to ensure tail-based control stability. A set of discrete time-invariant mountain car control systems is used to evaluate the proposed solution and is compared against four variants that use state-of-the-art scheduling, prediction, and control methods. The experimental results indicate that the proposed method yields 22% reduction in overall cost in terms of communication and control resource utilization compared to state-of-the-art methods.
- [664] arXiv:2406.14304 [pdf, html, other]
-
Title: A Variational Characterization of $H$-Mutual Information and its Application to Computing $H$-CapacitySubjects: Information Theory (cs.IT)
$H$-mutual information ($H$-MI) is a wide class of information leakage measures, where $H=(\eta, F)$ is a pair of monotonically increasing function $\eta$ and a concave function $F$, which is a generalization of Shannon entropy. $H$-MI is defined as the difference between the generalized entropy $H$ and its conditional version, including Shannon mutual information (MI), Arimoto MI of order $\alpha$, $g$-leakage, and expected value of sample information. This study presents a variational characterization of $H$-MI via statistical decision theory. Based on the characterization, we propose an alternating optimization algorithm for computing $H$-capacity.
- [665] arXiv:2406.14305 [pdf, html, other]
-
Title: Disproving Termination of Non-Erasing Sole Combinatory Calculus with Tree Automata (Full Version)Comments: This is the full version of the corresponding CIAA 2024 paperSubjects: Logic in Computer Science (cs.LO); Formal Languages and Automata Theory (cs.FL)
We study the termination of sole combinatory calculus, which consists of only one combinator. Specifically, the termination for non-erasing combinators is disproven by finding a desirable tree automaton with a SAT solver as done for term rewriting systems by Endrullis and Zantema. We improved their technique to apply to non-erasing sole combinatory calculus, in which it suffices to search for tree automata with a final sink state. Our method succeeds in disproving the termination of 8 combinators, whose termination has been an open problem.
- [666] arXiv:2406.14309 [pdf, other]
-
Title: Emerging-properties Mapping Using Spatial Embedding Statistics: EMUSESChris Foulon, Marcela Ovando-Tellez, Lia Talozzi, Maurizio Corbetta, Anna Matsulevits, Michel Thiebaut de SchottenComments: 14 pages, 3 figuresSubjects: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
Understanding complex phenomena often requires analyzing high-dimensional data to uncover emergent properties that arise from multifactorial interactions. Here, we present EMUSES (Emerging-properties Mapping Using Spatial Embedding Statistics), an innovative approach employing Uniform Manifold Approximation and Projection (UMAP) to create high-dimensional embeddings that reveal latent structures within data. EMUSES facilitates the exploration and prediction of emergent properties by statistically analyzing these latent spaces. Using three distinct datasets--a handwritten digits dataset from the National Institute of Standards and Technology (NIST, E. Alpaydin, 1998), the Chicago Face Database (Ma et al., 2015), and brain disconnection data post-stroke (Talozzi et al., 2023)--we demonstrate EMUSES' effectiveness in detecting and interpreting emergent properties. Our method not only predicts outcomes with high accuracy but also provides clear visualizations and statistical insights into the underlying interactions within the data. By bridging the gap between predictive accuracy and interpretability, EMUSES offers researchers a powerful tool to understand the multifactorial origins of complex phenomena.
- [667] arXiv:2406.14310 [pdf, html, other]
-
Title: Cross-level Requirement Traceability: A Novel Approach Integrating Bag-of-Words and Word Embedding for Enhanced Similarity FunctionalitySubjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Requirement traceability is the process of identifying the inter-dependencies between requirements. It poses a significant challenge when conducted manually, especially when dealing with requirements at various levels of abstraction. In this work, we propose a novel approach to automate the task of linking high-level business requirements with more technical system requirements. The proposed approach begins by representing each requirement using a Bag of-Words (BOW) model combined with the Term Frequency-Inverse Document Frequency (TF-IDF) scoring function. Then, we suggested an enhanced cosine similarity that uses recent advances in word embedding representation to correct traditional cosine similarity function limitations. To evaluate the effectiveness of our approach, we conducted experiments on three well-known datasets: COEST, WARC(NFR), and WARC(FRS). The results demonstrate that our approach significantly improves efficiency compared to existing methods. We achieved better results with an increase of approximately 18.4% in one of the datasets, as measured by the F2 score.
- [668] arXiv:2406.14312 [pdf, html, other]
-
Title: Infusing clinical knowledge into tokenisers for language modelsAbul Hasan, Jinge Wu, Quang Ngoc Nguyen, Salomé Andres, Imane Guellil, Huayu Zhang, Arlene Casey, Beatrice Alex, Bruce Guthrie, Honghan WuComments: 18 pages, 6 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
This study introduces a novel knowledge enhanced tokenisation mechanism, K-Tokeniser, for clinical text processing. Technically, at initialisation stage, K-Tokeniser populates global representations of tokens based on semantic types of domain concepts (such as drugs or diseases) from either a domain ontology like Unified Medical Language System or the training data of the task related corpus. At training or inference stage, sentence level localised context will be utilised for choosing the optimal global token representation to realise the semantic-based tokenisation. To avoid pretraining using the new tokeniser, an embedding initialisation approach is proposed to generate representations for new tokens. Using three transformer-based language models, a comprehensive set of experiments are conducted on four real-world datasets for evaluating K-Tokeniser in a wide range of clinical text analytics tasks including clinical concept and relation extraction, automated clinical coding, clinical phenotype identification, and clinical research article classification. Overall, our models demonstrate consistent improvements over their counterparts in all tasks. In particular, substantial improvements are observed in the automated clinical coding task with 13\% increase on Micro $F_1$ score. Furthermore, K-Tokeniser also shows significant capacities in facilitating quicker converge of language models. Specifically, using K-Tokeniser, the language models would only require 50\% of the training data to achieve the best performance of the baseline tokeniser using all training data in the concept extraction task and less than 20\% of the data for the automated coding task. It is worth mentioning that all these improvements require no pre-training process, making the approach generalisable.
- [669] arXiv:2406.14313 [pdf, html, other]
-
Title: Robust Few-shot Transfer Learning for Knowledge Base Question Answering with Unanswerable QuestionsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Real-world KBQA applications require models that are (1) robust -- e.g., can differentiate between answerable and unanswerable questions, and (2) low-resource -- do not require large training data. Towards this goal, we propose the novel task of few-shot transfer for KBQA with unanswerable questions. We present FUn-FuSIC that extends the state-of-the-art (SoTA) few-shot transfer model for answerable-only KBQA to handle unanswerability. It iteratively prompts an LLM to generate logical forms for the question by providing feedback using a diverse suite of syntactic, semantic and execution guided checks, and adapts self-consistency to assess confidence of the LLM to decide answerability. Experiments over newly constructed datasets show that FUn-FuSIC outperforms suitable adaptations of the SoTA model for KBQA with unanswerability, and the SoTA model for answerable-only few-shot-transfer KBQA.
- [670] arXiv:2406.14314 [pdf, html, other]
-
Title: Identifying User Goals from UI TrajectoriesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Autonomous agents that interact with graphical user interfaces (GUIs) hold significant potential for enhancing user experiences. To further improve these experiences, agents need to be personalized and proactive. By effectively comprehending user intentions through their actions and interactions with GUIs, agents will be better positioned to achieve these goals. This paper introduces the task of goal identification from observed UI trajectories, aiming to infer the user's intended task based on their GUI interactions. We propose a novel evaluation metric to assess whether two task descriptions are paraphrases within a specific UI environment. By Leveraging the inverse relation with the UI automation task, we utilized the Android-In-The-Wild and Mind2Web datasets for our experiments. Using our metric and these datasets, we conducted several experiments comparing the performance of humans and state-of-the-art models, specifically GPT-4 and Gemini-1.5 Pro. Our results show that Gemini performs better than GPT but still underperforms compared to humans, indicating significant room for improvement.
- [671] arXiv:2406.14315 [pdf, html, other]
-
Title: AI-coupled HPC Workflow Applications, Middleware and PerformanceSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
AI integration is revolutionizing the landscape of HPC simulations, enhancing the importance, use, and performance of AI-driven HPC workflows. This paper surveys the diverse and rapidly evolving field of AI-driven HPC and provides a common conceptual basis for understanding AI-driven HPC workflows. Specifically, we use insights from different modes of coupling AI into HPC workflows to propose six execution motifs most commonly found in scientific applications. The proposed set of execution motifs is by definition incomplete and evolving. However, they allow us to analyze the primary performance challenges underpinning AI-driven HPC workflows. We close with a listing of open challenges, research issues, and suggested areas of investigation including the the need for specific benchmarks that will help evaluate and improve the execution of AI-driven HPC workflows.
- [672] arXiv:2406.14317 [pdf, html, other]
-
Title: Maximum principle preserving time implicit DGSEM for nonlinear scalar conservation lawsComments: 25 pages, 15 figures, 1 table, 55 referencesSubjects: Numerical Analysis (math.NA)
This work concerns the analysis of the discontinuous Galerkin spectral element method (DGSEM) with implicit time stepping for the numerical approximation of nonlinear scalar conservation laws in multiple space dimensions. We consider either the DGSEM with a backward Euler time stepping, or a space-time DGSEM discretization to remove the restriction on the time step. We design graph viscosities in space, and in time for the space-time DGSEM, to make the schemes maximum principle preserving and entropy stable for every admissible convex entropy. We also establish well-posedness of the discrete problems by showing existence and uniqueness of the solutions to the nonlinear implicit algebraic relations that need to be solved at each time step. Numerical experiments in one space dimension are presented to illustrate the properties of these schemes.
- [673] arXiv:2406.14318 [pdf, html, other]
-
Title: The Fire Thief Is Also the Keeper: Balancing Usability and Privacy in PromptsSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
The rapid adoption of online chatbots represents a significant advancement in artificial intelligence. However, this convenience brings considerable privacy concerns, as prompts can inadvertently contain sensitive information exposed to large language models (LLMs). Limited by high computational costs, reduced task usability, and excessive system modifications, previous works based on local deployment, embedding perturbation, and homomorphic encryption are inapplicable to online prompt-based LLM applications.
To address these issues, this paper introduces Prompt Privacy Sanitizer (i.e., ProSan), an end-to-end prompt privacy protection framework that can produce anonymized prompts with contextual privacy removed while maintaining task usability and human readability. It can also be seamlessly integrated into the online LLM service pipeline. To achieve high usability and dynamic anonymity, ProSan flexibly adjusts its protection targets and strength based on the importance of the words and the privacy leakage risk of the prompts. Additionally, ProSan is capable of adapting to diverse computational resource conditions, ensuring privacy protection even for mobile devices with limited computing power. Our experiments demonstrate that ProSan effectively removes private information across various tasks, including question answering, text summarization, and code generation, with minimal reduction in task performance. - [674] arXiv:2406.14319 [pdf, html, other]
-
Title: LiveMind: Low-latency Large Language Models with Simultaneous InferenceSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
In this paper, we introduce a novel low-latency inference framework for large language models (LLMs) inference which enables LLMs to perform inferences with incomplete prompts. By reallocating computational processes to prompt input phase, we achieve a substantial reduction in latency, thereby significantly enhancing the interactive experience for users of LLMs. The framework adeptly manages the visibility of the streaming prompt to the model, allowing it to infer from incomplete prompts or await additional prompts. Compared with traditional inference methods that utilize complete prompts, our approach demonstrates an average reduction of 59% in response latency on the MMLU-Pro dataset, while maintaining comparable accuracy. Additionally, our framework facilitates collaborative inference and output across different models. By employing an LLM for inference and a small language model (SLM) for output, we achieve an average 68% reduction in response latency, alongside a 5.5% improvement in accuracy on the MMLU-Pro dataset compared with the SLM baseline. For long prompts exceeding 20 sentences, the response latency can be reduced by up to 93%.
- [675] arXiv:2406.14322 [pdf, html, other]
-
Title: Mind the Privacy Unit! User-Level Differential Privacy for Language Model Fine-TuningLynn Chua, Badih Ghazi, Yangsibo Huang, Pritish Kamath, Daogao Liu, Pasin Manurangsi, Amer Sinha, Chiyuan ZhangSubjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Large language models (LLMs) have emerged as powerful tools for tackling complex tasks across diverse domains, but they also raise privacy concerns when fine-tuned on sensitive data due to potential memorization. While differential privacy (DP) offers a promising solution by ensuring models are `almost indistinguishable' with or without any particular privacy unit, current evaluations on LLMs mostly treat each example (text record) as the privacy unit. This leads to uneven user privacy guarantees when contributions per user vary. We therefore study user-level DP motivated by applications where it necessary to ensure uniform privacy protection across users. We present a systematic evaluation of user-level DP for LLM fine-tuning on natural language generation tasks. Focusing on two mechanisms for achieving user-level DP guarantees, Group Privacy and User-wise DP-SGD, we investigate design choices like data selection strategies and parameter tuning for the best privacy-utility tradeoff.
- [676] arXiv:2406.14324 [pdf, html, other]
-
Title: Revealing the learning process in reinforcement learning agents through attention-oriented metricsSubjects: Machine Learning (cs.LG)
The learning process of a reinforcement learning (RL) agent remains poorly understood beyond the mathematical formulation of its learning algorithm. To address this gap, we introduce attention-oriented metrics (ATOMs) to investigate the development of an RL agent's attention during training. We tested ATOMs on three variations of a Pong game, each designed to teach the agent distinct behaviours, complemented by a behavioural assessment. Our findings reveal that ATOMs successfully delineate the attention patterns of an agent trained on each game variation, and that these differences in attention patterns translate into differences in the agent's behaviour. Through continuous monitoring of ATOMs during training, we observed that the agent's attention developed in phases, and that these phases were consistent across games. Finally, we noted that the agent's attention to its paddle emerged relatively late in the training and coincided with a marked increase in its performance score. Overall, we believe that ATOMs could significantly enhance our understanding of RL agents' learning processes, which is essential for improving their reliability and efficiency.
- [677] arXiv:2406.14325 [pdf, html, other]
-
Title: Reproducibility in Machine Learning-based Research: Overview, Barriers and DriversHarald Semmelrock, Tony Ross-Hellauer, Simone Kopeinik, Dieter Theiler, Armin Haberl, Stefan Thalmann, Dominik KowaldComments: Pre-print of submission planned to the AI MagazineSubjects: Software Engineering (cs.SE); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Research in various fields is currently experiencing challenges regarding the reproducibility of results. This problem is also prevalent in machine learning (ML) research. The issue arises primarily due to unpublished data and/or source code and the sensitivity of ML training conditions. Although different solutions have been proposed to address this issue, such as using ML platforms, the level of reproducibility in ML-driven research remains unsatisfactory. Therefore, in this article, we discuss the reproducibility of ML-driven research with three main aims: (i) identify the barriers to reproducibility when applying ML in research as well as categorize the barriers to different types of reproducibility (description, code, data, and experiment reproducibility), (ii) identify potential drivers such as tools, practices, and interventions that support ML reproducibility as well as distinguish between technology-driven drivers, procedural drivers, and drivers related to awareness and education, and (iii) map the drivers to the barriers. With this work, we hope to provide insights and contribute to the decision-making process regarding the adoption of different solutions to support ML reproducibility.
- [678] arXiv:2406.14326 [pdf, html, other]
-
Title: medIKAL: Integrating Knowledge Graphs as Assistants of LLMs for Enhanced Clinical Diagnosis on EMRsSubjects: Computation and Language (cs.CL)
Electronic Medical Records (EMRs), while integral to modern healthcare, present challenges for clinical reasoning and diagnosis due to their complexity and information redundancy. To address this, we proposed medIKAL (Integrating Knowledge Graphs as Assistants of LLMs), a framework that combines Large Language Models (LLMs) with knowledge graphs (KGs) to enhance diagnostic capabilities. medIKAL assigns weighted importance to entities in medical records based on their type, enabling precise localization of candidate diseases within KGs. It innovatively employs a residual network-like approach, allowing initial diagnosis by the LLM to be merged into KG search results. Through a path-based reranking algorithm and a fill-in-the-blank style prompt template, it further refined the diagnostic process. We validated medIKAL's effectiveness through extensive experiments on a newly introduced open-sourced Chinese EMR dataset, demonstrating its potential to improve clinical diagnosis in real-world settings.
- [679] arXiv:2406.14328 [pdf, html, other]
-
Title: Computing Within Limits: An Empirical Study of Energy Consumption in ML Training and InferenceComments: Accepted for publication at ARISDE2024: 1st International Workshop on Artificial Intelligence for Sustainable DevelopmentSubjects: Machine Learning (cs.LG)
Machine learning (ML) has seen tremendous advancements, but its environmental footprint remains a concern. Acknowledging the growing environmental impact of ML this paper investigates Green ML, examining various model architectures and hyperparameters in both training and inference phases to identify energy-efficient practices. Our study leverages software-based power measurements for ease of replication across diverse configurations, models and datasets. In this paper, we examine multiple models and hardware configurations to identify correlations across the various measurements and metrics and key contributors to energy reduction. Our analysis offers practical guidelines for constructing sustainable ML operations, emphasising energy consumption and carbon footprint reductions while maintaining performance. As identified, short-lived profiling can quantify the long-term expected energy consumption. Moreover, model parameters can also be used to accurately estimate the expected total energy without the need for extensive experimentation.
- [680] arXiv:2406.14329 [pdf, html, other]
-
Title: Adaptive Adversarial Cross-Entropy Loss for Sharpness-Aware MinimizationComments: Accepted in ICIP2024. The project page can be accessed at this http URLSubjects: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Recent advancements in learning algorithms have demonstrated that the sharpness of the loss surface is an effective measure for improving the generalization gap. Building upon this concept, Sharpness-Aware Minimization (SAM) was proposed to enhance model generalization and achieved state-of-the-art performance. SAM consists of two main steps, the weight perturbation step and the weight updating step. However, the perturbation in SAM is determined by only the gradient of the training loss, or cross-entropy loss. As the model approaches a stationary point, this gradient becomes small and oscillates, leading to inconsistent perturbation directions and also has a chance of diminishing the gradient. Our research introduces an innovative approach to further enhancing model generalization. We propose the Adaptive Adversarial Cross-Entropy (AACE) loss function to replace standard cross-entropy loss for SAM's perturbation. AACE loss and its gradient uniquely increase as the model nears convergence, ensuring consistent perturbation direction and addressing the gradient diminishing issue. Additionally, a novel perturbation-generating function utilizing AACE loss without normalization is proposed, enhancing the model's exploratory capabilities in near-optimum stages. Empirical testing confirms the effectiveness of AACE, with experiments demonstrating improved performance in image classification tasks using Wide ResNet and PyramidNet across various datasets. The reproduction code is available online
- [681] arXiv:2406.14333 [pdf, html, other]
-
Title: LARP: Language Audio Relational Pre-training for Cold-Start Playlist ContinuationSubjects: Information Retrieval (cs.IR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
As online music consumption increasingly shifts towards playlist-based listening, the task of playlist continuation, in which an algorithm suggests songs to extend a playlist in a personalized and musically cohesive manner, has become vital to the success of music streaming. Currently, many existing playlist continuation approaches rely on collaborative filtering methods to perform recommendation. However, such methods will struggle to recommend songs that lack interaction data, an issue known as the cold-start problem. Current approaches to this challenge design complex mechanisms for extracting relational signals from sparse collaborative data and integrating them into content representations. However, these approaches leave content representation learning out of scope and utilize frozen, pre-trained content models that may not be aligned with the distribution or format of a specific musical setting. Furthermore, even the musical state-of-the-art content modules are either (1) incompatible with the cold-start setting or (2) unable to effectively integrate cross-modal and relational signals. In this paper, we introduce LARP, a multi-modal cold-start playlist continuation model, to effectively overcome these limitations. LARP is a three-stage contrastive learning framework that integrates both multi-modal and relational signals into its learned representations. Our framework uses increasing stages of task-specific abstraction: within-track (language-audio) contrastive loss, track-track contrastive loss, and track-playlist contrastive loss. Experimental results on two publicly available datasets demonstrate the efficacy of LARP over uni-modal and multi-modal models for playlist continuation in a cold-start setting. Code and dataset are released at: this https URL.
- [682] arXiv:2406.14335 [pdf, html, other]
-
Title: Self-supervised Interpretable Concept-based Models for Text ClassificationFrancesco De Santis, Philippe Bich, Gabriele Ciravegna, Pietro Barbiero, Danilo Giordano, Tania CerquitelliSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Despite their success, Large-Language Models (LLMs) still face criticism as their lack of interpretability limits their controllability and reliability. Traditional post-hoc interpretation methods, based on attention and gradient-based analysis, offer limited insight into the model's decision-making processes. In the image field, Concept-based models have emerged as explainable-by-design architectures, employing human-interpretable features as intermediate representations. However, these methods have not been yet adapted to textual data, mainly because they require expensive concept annotations, which are impractical for real-world text data. This paper addresses this challenge by proposing a self-supervised Interpretable Concept Embedding Models (ICEMs). We leverage the generalization abilities of LLMs to predict the concepts labels in a self-supervised way, while we deliver the final predictions with an interpretable function. The results of our experiments show that ICEMs can be trained in a self-supervised way achieving similar performance to fully supervised concept-based models and end-to-end black-box ones. Additionally, we show that our models are (i) interpretable, offering meaningful logical explanations for their predictions; (ii) interactable, allowing humans to modify intermediate predictions through concept interventions; and (iii) controllable, guiding the LLMs' decoding process to follow a required decision-making path.
- [683] arXiv:2406.14336 [pdf, html, other]
-
Title: Exploring Spatial Representations in the Historical Lake District Texts with LLM-based Relation ExtractionSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Navigating historical narratives poses a challenge in unveiling the spatial intricacies of past landscapes. The proposed work addresses this challenge within the context of the English Lake District, employing the Corpus of the Lake District Writing. The method utilizes a generative pre-trained transformer model to extract spatial relations from the textual descriptions in the corpus. The study applies this large language model to understand the spatial dimensions inherent in historical narratives comprehensively. The outcomes are presented as semantic triples, capturing the nuanced connections between entities and locations, and visualized as a network, offering a graphical representation of the spatial narrative. The study contributes to a deeper comprehension of the English Lake District's spatial tapestry and provides an approach to uncovering spatial relations within diverse historical contexts.
- [684] arXiv:2406.14338 [pdf, html, other]
-
Title: Adaptive Robust Controller for handling Unknown Uncertainty of Robotic ManipulatorsSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
The ability to achieve precise and smooth trajectory tracking is crucial for ensuring the successful execution of various tasks involving robotic manipulators. State-of-the-art techniques require accurate mathematical models of the robot dynamics, and robustness to model uncertainties is achieved by relying on precise bounds on the model mismatch. In this paper, we propose a novel adaptive robust feedback linearization scheme able to compensate for model uncertainties without any a-priori knowledge on them, and we provide a theoretical proof of convergence under mild assumptions. We evaluate the method on a simulated RR robot. First, we consider a nominal model with known model mismatch, which allows us to compare our strategy with state-of-the-art uncertainty-aware methods. Second, we implement the proposed control law in combination with a learned model, for which uncertainty bounds are not available. Results show that our method leads to performance comparable to uncertainty-aware methods while requiring less prior knowledge.
- [685] arXiv:2406.14341 [pdf, html, other]
-
Title: HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting?Subjects: Machine Learning (cs.LG)
In sequential event prediction, which finds applications in finance, retail, social networks, and healthcare, a crucial task is forecasting multiple future events within a specified time horizon. Traditionally, this has been addressed through autoregressive generation using next-event prediction models, such as Marked Temporal Point Processes. However, autoregressive methods use their own output for future predictions, potentially reducing quality as the prediction horizon extends. In this paper, we challenge traditional approaches by introducing a novel benchmark, HoTPP, specifically designed to evaluate a model's ability to predict event sequences over a horizon. This benchmark features a new metric inspired by object detection in computer vision, addressing the limitations of existing metrics in assessing models with imprecise time-step predictions. Our evaluations on established datasets employing various models demonstrate that high accuracy in next-event prediction does not necessarily translate to superior horizon prediction, and vice versa. HoTPP aims to serve as a valuable tool for developing more robust event sequence prediction methods, ultimately paving the way for further advancements in the field.
- [686] arXiv:2406.14343 [pdf, html, other]
-
Title: iWISDM: Assessing instruction following in multimodal models at scaleSubjects: Artificial Intelligence (cs.AI)
The ability to perform complex tasks from detailed instructions is a key to many remarkable achievements of our species. As humans, we are not only capable of performing a wide variety of tasks but also very complex ones that may entail hundreds or thousands of steps to complete. Large language models and their more recent multimodal counterparts that integrate textual and visual inputs have achieved unprecedented success in performing complex tasks. Yet, most existing benchmarks are largely confined to single-modality inputs (either text or vision), narrowing the scope of multimodal assessments, particularly for instruction-following in multimodal contexts. To bridge this gap, we introduce the instructed-Virtual VISual Decision Making (iWISDM) environment engineered to generate a limitless array of vision-language tasks of varying complexity. Using iWISDM, we compiled three distinct benchmarks of instruction following visual tasks across varying complexity levels and evaluated several newly developed multimodal models on these benchmarks. Our findings establish iWISDM as a robust benchmark for assessing the instructional adherence of both existing and emergent multimodal models and highlight a large gap between these models' ability to precisely follow instructions with that of humans.
- [687] arXiv:2406.14349 [pdf, html, other]
-
Title: Can you trust your explanations? A robustness test for feature attribution methodsComments: 8 pages, 3 figuresSubjects: Machine Learning (cs.LG)
The increase of legislative concerns towards the usage of Artificial Intelligence (AI) has recently led to a series of regulations striving for a more transparent, trustworthy and accountable AI. Along with these proposals, the field of Explainable AI (XAI) has seen a rapid growth but the usage of its techniques has at times led to unexpected results. The robustness of the approaches is, in fact, a key property often overlooked: it is necessary to evaluate the stability of an explanation (to random and adversarial perturbations) to ensure that the results are trustable. To this end, we propose a test to evaluate the robustness to non-adversarial perturbations and an ensemble approach to analyse more in depth the robustness of XAI methods applied to neural networks and tabular datasets. We will show how leveraging manifold hypothesis and ensemble approaches can be beneficial to an in-depth analysis of the robustness.
- [688] arXiv:2406.14359 [pdf, html, other]
-
Title: Learning to Transfer for Evolutionary MultitaskingComments: Under reviewSubjects: Neural and Evolutionary Computing (cs.NE)
Evolutionary multitasking (EMT) is an emerging approach for solving multitask optimization problems (MTOPs) and has garnered considerable research interest. The implicit EMT is a significant research branch that utilizes evolution operators to enable knowledge transfer (KT) between tasks. However, current approaches in implicit EMT face challenges in adaptability, due to the use of a limited number of evolution operators and insufficient utilization of evolutionary states for performing KT. This results in suboptimal exploitation of implicit KT's potential to tackle a variety of MTOPs. To overcome these limitations, we propose a novel Learning to Transfer (L2T) framework to automatically discover efficient KT policies for the MTOPs at hand. Our framework conceptualizes the KT process as a learning agent's sequence of strategic decisions within the EMT process. We propose an action formulation for deciding when and how to transfer, a state representation with informative features of evolution states, a reward formulation concerning convergence and transfer efficiency gain, and the environment for the agent to interact with MTOPs. We employ an actor-critic network structure for the agent and learn it via proximal policy optimization. This learned agent can be integrated with various evolutionary algorithms, enhancing their ability to address a range of new MTOPs. Comprehensive empirical studies on both synthetic and real-world MTOPs, encompassing diverse inter-task relationships, function classes, and task distributions are conducted to validate the proposed L2T framework. The results show a marked improvement in the adaptability and performance of implicit EMT when solving a wide spectrum of unseen MTOPs.
- [689] arXiv:2406.14360 [pdf, html, other]
-
Title: Deblurring Neural Radiance Fields with Event-driven Bundle AdjustmentSubjects: Computer Vision and Pattern Recognition (cs.CV)
Neural Radiance Fields (NeRF) achieve impressive 3D representation learning and novel view synthesis results with high-quality multi-view images as input. However, motion blur in images often occurs in low-light and high-speed motion scenes, which significantly degrade the reconstruction quality of NeRF. Previous deblurring NeRF methods are struggling to estimate information during the exposure time, unable to accurately model the motion blur. In contrast, the bio-inspired event camera measuring intensity changes with high temporal resolution makes up this information deficiency. In this paper, we propose Event-driven Bundle Adjustment for Deblurring Neural Radiance Fields (EBAD-NeRF) to jointly optimize the learnable poses and NeRF parameters by leveraging the hybrid event-RGB data. An intensity-change-metric event loss and a photo-metric blur loss are introduced to strengthen the explicit modeling of camera motion blur. Experiment results on both synthetic data and real captured data demonstrate that EBAD-NeRF can obtain accurate camera poses during the exposure time and learn sharper 3D representations compared to prior works.
- [690] arXiv:2406.14361 [pdf, html, other]
-
Title: Robustness Analysis of AI Models in Critical Energy SystemsSubjects: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
This paper analyzes the robustness of state-of-the-art AI-based models for power grid operations under the $N-1$ security criterion. While these models perform well in regular grid settings, our results highlight a significant loss in accuracy following the disconnection of a line.%under this security criterion. Using graph theory-based analysis, we demonstrate the impact of node connectivity on this loss. Our findings emphasize the need for practical scenario considerations in developing AI methodologies for critical infrastructure.
- [691] arXiv:2406.14362 [pdf, other]
-
Title: Communication-Efficient Byzantine-Resilient Federated Zero-Order OptimizationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
We introduce CYBER-0, the first zero-order optimization algorithm for memory-and-communication efficient Federated Learning, resilient to Byzantine faults. We show through extensive numerical experiments on the MNIST dataset and finetuning RoBERTa-Large that CYBER-0 outperforms state-of-the-art algorithms in terms of communication and memory efficiency while reaching similar accuracy. We provide theoretical guarantees on its convergence for convex loss functions.
- [692] arXiv:2406.14365 [pdf, html, other]
-
Title: Mask the Unknown: Assessing Different Strategies to Handle Weak Annotations in the MICCAI2023 Mediastinal Lymph Node Quantification ChallengeComments: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URLJournal-ref: Machine.Learning.for.Biomedical.Imaging. 2 (2024)Subjects: Computer Vision and Pattern Recognition (cs.CV)
Pathological lymph node delineation is crucial in cancer diagnosis, progression assessment, and treatment planning. The MICCAI 2023 Lymph Node Quantification Challenge published the first public dataset for pathological lymph node segmentation in the mediastinum. As lymph node annotations are expensive, the challenge was formed as a weakly supervised learning task, where only a subset of all lymph nodes in the training set have been annotated. For the challenge submission, multiple methods for training on these weakly supervised data were explored, including noisy label training, loss masking of unlabeled data, and an approach that integrated the TotalSegmentator toolbox as a form of pseudo labeling in order to reduce the number of unknown voxels. Furthermore, multiple public TCIA datasets were incorporated into the training to improve the performance of the deep learning model. Our submitted model achieved a Dice score of 0.628 and an average symmetric surface distance of 5.8~mm on the challenge test set. With our submitted model, we accomplished third rank in the MICCAI2023 LNQ challenge. A finding of our analysis was that the integration of all visible, including non-pathological, lymph nodes improved the overall segmentation performance on pathological lymph nodes of the test set. Furthermore, segmentation models trained only on clinically enlarged lymph nodes, as given in the challenge scenario, could not generalize to smaller pathological lymph nodes. The code and model for the challenge submission are available at \url{this https URL}.
- [693] arXiv:2406.14367 [pdf, html, other]
-
Title: PoseBench: Benchmarking the Robustness of Pose Estimation Models under CorruptionsComments: Technical report. Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Pose estimation aims to accurately identify anatomical keypoints in humans and animals using monocular images, which is crucial for various applications such as human-machine interaction, embodied AI, and autonomous driving. While current models show promising results, they are typically trained and tested on clean data, potentially overlooking the corruption during real-world deployment and thus posing safety risks in practical scenarios. To address this issue, we introduce PoseBench, a comprehensive benchmark designed to evaluate the robustness of pose estimation models against real-world corruption. We evaluated 60 representative models, including top-down, bottom-up, heatmap-based, regression-based, and classification-based methods, across three datasets for human and animal pose estimation. Our evaluation involves 10 types of corruption in four categories: 1) blur and noise, 2) compression and color loss, 3) severe lighting, and 4) masks. Our findings reveal that state-of-the-art models are vulnerable to common real-world corruptions and exhibit distinct behaviors when tackling human and animal pose estimation tasks. To improve model robustness, we delve into various design considerations, including input resolution, pre-training datasets, backbone capacity, post-processing, and data augmentations. We hope that our benchmark will serve as a foundation for advancing research in robust pose estimation. The benchmark and source code will be released at this https URL
- [694] arXiv:2406.14370 [pdf, html, other]
-
Title: Enhanced Bank Check Security: Introducing a Novel Dataset and Transformer-Based Approach for Detection and VerificationComments: Accepted for publication in 16th IAPR International Workshop on Document Analysis Systems 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Automated signature verification on bank checks is critical for fraud prevention and ensuring transaction authenticity. This task is challenging due to the coexistence of signatures with other textual and graphical elements on real-world documents. Verification systems must first detect the signature and then validate its authenticity, a dual challenge often overlooked by current datasets and methodologies focusing only on verification. To address this gap, we introduce a novel dataset specifically designed for signature verification on bank checks. This dataset includes a variety of signature styles embedded within typical check elements, providing a realistic testing ground for advanced detection methods. Moreover, we propose a novel approach for writer-independent signature verification using an object detection network. Our detection-based verification method treats genuine and forged signatures as distinct classes within an object detection framework, effectively handling both detection and verification. We employ a DINO-based network augmented with a dilation module to detect and verify signatures on check images simultaneously. Our approach achieves an AP of 99.2 for genuine and 99.4 for forged signatures, a significant improvement over the DINO baseline, which scored 93.1 and 89.3 for genuine and forged signatures, respectively. This improvement highlights our dilation module's effectiveness in reducing both false positives and negatives. Our results demonstrate substantial advancements in detection-based signature verification technology, offering enhanced security and efficiency in financial document processing.
- [695] arXiv:2406.14372 [pdf, html, other]
-
Title: Ring-LWE based encrypted controller with unlimited number of recursive multiplications and effect of error growthComments: 12 pages, 3 figuresSubjects: Systems and Control (eess.SY)
In this paper, we propose a method to encrypt linear dynamic controllers that enables an unlimited number of recursive homomorphic multiplications on a Ring Learning With Errors (Ring-LWE) based cryptosystem without bootstrapping. Unlike LWE based schemes, where a scalar error is injected during encryption for security, Ring-LWE based schemes are based on polynomial rings and inject error as a polynomial having multiple error coefficients. Such errors accumulate under recursive homomorphic operations, and it has been studied that their effect can be suppressed by the closed-loop stability when dynamic controllers are encrypted using LWE based schemes. We show that this also holds for the proposed controller encrypted using a Ring-LWE based scheme. Specifically, only the constant terms of the error polynomials affect the control performance, and their effect can be arbitrarily bounded even when the noneffective terms diverge. Furthermore, a novel packing algorithm is applied, resulting in reduced computation time and enhanced memory efficiency. Simulation results demonstrate the effectiveness of the proposed method.
- [696] arXiv:2406.14373 [pdf, html, other]
-
Title: Artificial Leviathan: Exploring Social Evolution of LLM Agents Through the Lens of Hobbesian Social Contract TheoryGordon Dai, Weijia Zhang, Jinhan Li, Siqi Yang, Chidera Onochie lbe, Srihas Rao, Arthur Caetano, Misha SraSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
The emergence of Large Language Models (LLMs) and advancements in Artificial Intelligence (AI) offer an opportunity for computational social science research at scale. Building upon prior explorations of LLM agent design, our work introduces a simulated agent society where complex social relationships dynamically form and evolve over time. Agents are imbued with psychological drives and placed in a sandbox survival environment. We conduct an evaluation of the agent society through the lens of Thomas Hobbes's seminal Social Contract Theory (SCT). We analyze whether, as the theory postulates, agents seek to escape a brutish "state of nature" by surrendering rights to an absolute sovereign in exchange for order and security. Our experiments unveil an alignment: Initially, agents engage in unrestrained conflict, mirroring Hobbes's depiction of the state of nature. However, as the simulation progresses, social contracts emerge, leading to the authorization of an absolute sovereign and the establishment of a peaceful commonwealth founded on mutual cooperation. This congruence between our LLM agent society's evolutionary trajectory and Hobbes's theoretical account indicates LLMs' capability to model intricate social dynamics and potentially replicate forces that shape human societies. By enabling such insights into group behavior and emergent societal phenomena, LLM-driven multi-agent simulations, while unable to simulate all the nuances of human behavior, may hold potential for advancing our understanding of social structures, group dynamics, and complex human systems.
- [697] arXiv:2406.14374 [pdf, html, other]
-
Title: Information-flow Interfaces and Security LatticesSubjects: Formal Languages and Automata Theory (cs.FL)
Information-flow interfaces is a formalism recently proposed for specifying, composing, and refining system-wide security requirements. In this work, we show how the widely used concept of security lattices provides a natural semantic interpretation for information-flow interfaces.
- [698] arXiv:2406.14377 [pdf, html, other]
-
Title: Computation-Efficient Semi-Supervised Learning for ECG-based Cardiovascular Diseases DetectionRushuang Zhou, Zijun Liu, Lei Clifton, David A. Clifton, Kannie W. Y. Chan, Yuan-Ting Zhang, Yining DongSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Label scarcity problem is the main challenge that hinders the wide application of deep learning systems in automatic cardiovascular diseases (CVDs) detection using electrocardiography (ECG). Tuning pre-trained models alleviates this problem by transferring knowledge learned from large datasets to downstream small datasets. However, bottlenecks in computational efficiency and CVDs detection performance limit its clinical applications. It is difficult to improve the detection performance without significantly sacrificing model computational efficiency. Here, we propose a computation-efficient semi-supervised learning paradigm (FastECG) for robust and computation-efficient CVDs detection using ECG. It enables a robust adaptation of pre-trained models on downstream datasets with limited supervision and high computational efficiency. First, a random-deactivation technique is developed to achieve robust and fast low-rank adaptation of pre-trained weights. Subsequently, we propose a one-shot rank allocation module to determine the optimal ranks for the update matrices of the pre-trained weights. Finally, a lightweight semi-supervised learning pipeline is introduced to enhance model performance by leveraging labeled and unlabeled data with high computational efficiency. Extensive experiments on four downstream ECG datasets demonstrate that FastECG not only outperforms the state-of-the-art methods in multi-label CVDs detection but also consumes fewer GPU footprints, training time, and parameter storage space. As such, this paradigm provides an effective solution for achieving high computational efficiency and robust detection performance in the clinical applications of pre-trained models under limited supervision.
- [699] arXiv:2406.14384 [pdf, html, other]
-
Title: Low-Step Multi-Commodity Flow EmulatorsComments: Appears at STOC 2024Subjects: Data Structures and Algorithms (cs.DS)
We introduce the concept of low-step multi-commodity flow emulators for any undirected, capacitated graph. At a high level, these emulators contain approximate multi-commodity flows whose paths contain a small number of edges, shattering the infamous flow decomposition barrier for multi-commodity flow.
We prove the existence of low-step multi-commodity flow emulators and develop efficient algorithms to compute them. We then apply them to solve constant-approximate $k$-commodity flow in $O((m+k)^{1+\epsilon})$ time. To bypass the $O(mk)$ flow decomposition barrier, we represent our output multi-commodity flow implicitly; prior to our work, even the existence of implicit constant-approximate multi-commodity flows of size $o(mk)$ was unknown.
Our results generalize to the minimum cost setting, where each edge has an associated cost and the multi-commodity flow must satisfy a cost budget. Our algorithms are also parallel. - [700] arXiv:2406.14385 [pdf, html, other]
-
Title: Semi-Autonomous Mobile Search and Rescue Robot for Radiation Disaster ScenariosSimon Schwaiger, Lucas Muster, Georg Novotny, Michael Schebek, Wilfried Wöber, Stefan Thalhammer, Christoph BöhmSubjects: Robotics (cs.RO)
This paper describes a novel semi-autonomous mobile robot system designed to assist search and rescue (SAR) first responders in disaster scenarios. While robots offer significant potential in SAR missions, current solutions are limited in their ability to handle a diverse range of tasks. This gap is addressed by presenting a system capable of (1) autonomous navigation and mapping, allowing the robot to autonomously explore and map areas affected by catastrophic events, (2) radiation mapping, enabling the system to triangulate a radiation map from discrete radiation measurements to aid in identifying hazardous areas, (3) semi-autonomous substance sampling, allowing the robot to collect samples of suspicious substances and analyze them onboard with immediate classification, and (4) valve manipulation, enabling teleoperated closing of valves that control hazardous material flow. This semi-autonomous approach balances human control over critical tasks like substance sampling with efficient robot navigation in low-risk areas. The system is evaluated during three trials that simulate possible disaster scenarios, two of which have been recorded during the European Robotics Hackathon (EnRicH). Furthermore, we provide recorded sensor data as well as the implemented software system as supplemental material through a GitHub repository: this https URL.
- [701] arXiv:2406.14388 [pdf, html, other]
-
Title: Active Diffusion SubsamplingComments: 17 pages, 12 figures, preprintSubjects: Machine Learning (cs.LG)
Subsampling is commonly used to mitigate costs associated with data acquisition, such as time or energy requirements, motivating the development of algorithms for estimating the fully-sampled signal of interest $x$ from partially observed measurements $y$. In maximum-entropy sampling, one selects measurement locations that are expected to have the highest entropy, so as to minimize uncertainty about $x$. This approach relies on an accurate model of the posterior distribution over future measurements, given the measurements observed so far. Recently, diffusion models have been shown to produce high-quality posterior samples of high-dimensional signals using guided diffusion. In this work, we propose Active Diffusion Subsampling (ADS), a method for performing active subsampling using guided diffusion in which the model tracks a distribution of beliefs over the true state of $x$ throughout the reverse diffusion process, progressively decreasing its uncertainty by choosing to acquire measurements with maximum expected entropy, and ultimately generating the posterior distribution $p(x | y)$. ADS can be applied using pre-trained diffusion models for any subsampling rate, and does not require task-specific retraining - just the specification of a measurement model. Furthermore, the maximum entropy sampling policy employed by ADS is interpretable, enhancing transparency relative to existing methods using black-box policies. Experimentally, we show that ADS outperforms fixed sampling strategies, and study an application of ADS in Magnetic Resonance Imaging acceleration using the fastMRI dataset, finding that ADS performs competitively with supervised methods. Code available at this https URL.
- [702] arXiv:2406.14391 [pdf, html, other]
-
Title: Safety-Critical Edge Robotics Architecture with Bounded End-to-End LatencyGautam Gala, Tilmann Unte, Luiz Maia, Johannes Kühbacher, Isser Kadusale, Mohammad Ibrahim Alkoudsi, Gerhard Fohler, Sebastian AltmeyerJournal-ref: 3rd International Workshop on Real-Time Cloud systems (RT-Cloud), held in conjunction with the 36th Euromicro Conference on Real-time Systems (ECRTS) 2024Subjects: Robotics (cs.RO); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET)
Edge computing processes data near its source, reducing latency and enhancing security compared to traditional cloud computing while providing its benefits. This paper explores edge computing for migrating an existing safety-critical robotics use case from an onboard dedicated hardware solution. We propose an edge robotics architecture based on Linux, Docker containers, Kubernetes, and a local wireless area network based on the TTWiFi protocol. Inspired by previous work on real-time cloud, we complement the architecture with a resource management and orchestration layer to help Linux manage, and Kubernetes orchestrate the system-wide shared resources (e.g., caches, memory bandwidth, and network). Our architecture aims to ensure the fault-tolerant and predictable execution of robotic applications (e.g., path planning) on the edge while upper-bounding the end-to-end latency and ensuring the best possible quality of service without jeopardizing safety and security.
- [703] arXiv:2406.14393 [pdf, html, other]
-
Title: Jailbreaking as a Reward Misspecification ProblemSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
The widespread adoption of large language models (LLMs) has raised concerns about their safety and reliability, particularly regarding their vulnerability to adversarial attacks. In this paper, we propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process. We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness and robustness in detecting harmful backdoor prompts. Building upon these insights, we present ReMiss, a system for automated red teaming that generates adversarial prompts against various target aligned LLMs. ReMiss achieves state-of-the-art attack success rates on the AdvBench benchmark while preserving the human readability of the generated prompts. Detailed analysis highlights the unique advantages brought by the proposed reward misspecification objective compared to previous methods.
- [704] arXiv:2406.14394 [pdf, html, other]
-
Title: SEC-QA: A Systematic Evaluation Corpus for Financial QASubjects: Computation and Language (cs.CL)
The financial domain frequently deals with large numbers of long documents that are essential for daily operations. Significant effort is put towards automating financial data analysis. However, a persistent challenge, not limited to the finance domain, is the scarcity of datasets that accurately reflect real-world tasks for model evaluation. Existing datasets are often constrained by size, context, or relevance to practical applications. Moreover, LLMs are currently trained on trillions of tokens of text, limiting access to novel data or documents that models have not encountered during training for unbiased evaluation. We propose SEC-QA, a continuous dataset generation framework with two key features: 1) the semi-automatic generation of Question-Answer (QA) pairs spanning multiple long context financial documents, which better represent real-world financial scenarios; 2) the ability to continually refresh the dataset using the most recent public document collections, not yet ingested by LLMs. Our experiments show that current retrieval augmented generation methods systematically fail to answer these challenging multi-document questions. In response, we introduce a QA system based on program-of-thought that improves the ability to perform complex information retrieval and quantitative reasoning pipelines, thereby increasing QA accuracy.
- [705] arXiv:2406.14397 [pdf, html, other]
-
Title: Jupyter Scatter: Interactive Exploration of Large-Scale DatasetsSubjects: Human-Computer Interaction (cs.HC)
Jupyter Scatter is a scalable, interactive, and interlinked scatterplot widget for exploring datasets in Jupyter Notebook/Lab, Colab, and VS Code. Its goal is to simplify the visual exploration, analysis, and comparison of large-scale bivariate datasets. Jupyter Scatter can render up to twenty million points, supports fast point selections, integrates with Pandas DataFrame and Matplotlib, uses perceptually-effective default settings, and offers a user-friendly API.
- [706] arXiv:2406.14398 [pdf, html, other]
-
Title: ATAC-Net: Zoomed view works better for Anomaly DetectionSubjects: Computer Vision and Pattern Recognition (cs.CV)
The application of deep learning in visual anomaly detection has gained widespread popularity due to its potential use in quality control and manufacturing. Current standard methods are Unsupervised, where a clean dataset is utilised to detect deviations and flag anomalies during testing. However, incorporating a few samples when the type of anomalies is known beforehand can significantly enhance performance. Thus, we propose ATAC-Net, a framework that trains to detect anomalies from a minimal set of known prior anomalies. Furthermore, we introduce attention-guided cropping, which provides a closer view of suspect regions during the training phase. Our framework is a reliable and easy-to-understand system for detecting anomalies, and we substantiate its superiority to some of the current state-of-the-art techniques in a comparable setting.
- [707] arXiv:2406.14399 [pdf, html, other]
-
Title: WEATHER-5K: A Large-scale Global Station Weather Dataset Towards Comprehensive Time-series Forecasting BenchmarkComments: 26 pages,13 figuresSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (stat.ML)
Global Station Weather Forecasting (GSWF) is crucial for various sectors, including aviation, agriculture, energy, and disaster preparedness. Recent advancements in deep learning have significantly improved the accuracy of weather predictions by optimizing models based on public meteorological data. However, existing public datasets for GSWF optimization and benchmarking still suffer from significant limitations, such as small sizes, limited temporal coverage, and a lack of comprehensive variables. These shortcomings prevent them from effectively reflecting the benchmarks of current forecasting methods and fail to support the real needs of operational weather forecasting. To address these challenges, we present the WEATHER-5K dataset. This dataset comprises a comprehensive collection of data from 5,672 weather stations worldwide, spanning a 10-year period with one-hour intervals. It includes multiple crucial weather elements, providing a more reliable and interpretable resource for forecasting. Furthermore, our WEATHER-5K dataset can serve as a benchmark for comprehensively evaluating existing well-known forecasting models, extending beyond GSWF methods to support future time-series research challenges and opportunities. The dataset and benchmark implementation are publicly available at: this https URL.
- [708] arXiv:2406.14401 [pdf, html, other]
-
Title: Fair Streaming Feature SelectionComments: 30 pages, 10 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Streaming feature selection techniques have become essential in processing real-time data streams, as they facilitate the identification of the most relevant attributes from continuously updating information. Despite their performance, current algorithms to streaming feature selection frequently fall short in managing biases and avoiding discrimination that could be perpetuated by sensitive attributes, potentially leading to unfair outcomes in the resulting models. To address this issue, we propose FairSFS, a novel algorithm for Fair Streaming Feature Selection, to uphold fairness in the feature selection process without compromising the ability to handle data in an online manner. FairSFS adapts to incoming feature vectors by dynamically adjusting the feature set and discerns the correlations between classification attributes and sensitive attributes from this revised set, thereby forestalling the propagation of sensitive data. Empirical evaluations show that FairSFS not only maintains accuracy that is on par with leading streaming feature selection methods and existing fair feature techniques but also significantly improves fairness metrics.
- [709] arXiv:2406.14402 [pdf, html, other]
-
Title: Logic-based analogical proportionsSubjects: Logic in Computer Science (cs.LO); Discrete Mathematics (cs.DM); Logic (math.LO)
The author has recently introduced an abstract algebraic framework of analogical proportions within the general setting of universal algebra. The purpose of this paper is to lift that framework from universal algebra to the strictly more expressive setting of full first-order logic. We show that the so-obtained logic-based framework preserves all desired properties and we prove novel results in that extended setting.
- [710] arXiv:2406.14404 [pdf, html, other]
-
Title: Predicting Probabilities of Error to Combine Quantization and Early Exiting: QuEEFlorence Regol, Joud Chataoui, Bertrand Charpentier, Mark Coates, Pablo Piantanida, Stephan GunnemannSubjects: Machine Learning (cs.LG)
Machine learning models can solve complex tasks but often require significant computational resources during inference. This has led to the development of various post-training computation reduction methods that tackle this issue in different ways, such as quantization which reduces the precision of weights and arithmetic operations, and dynamic networks which adapt computation to the sample at hand. In this work, we propose a more general dynamic network that can combine both quantization and early exit dynamic network: QuEE. Our algorithm can be seen as a form of soft early exiting or input-dependent compression. Rather than a binary decision between exiting or continuing, we introduce the possibility of continuing with reduced computation. This complicates the traditionally considered early exiting problem, which we solve through a principled formulation. The crucial factor of our approach is accurate prediction of the potential accuracy improvement achievable through further computation. We demonstrate the effectiveness of our method through empirical evaluation, as well as exploring the conditions for its success on 4 classification datasets.
- [711] arXiv:2406.14405 [pdf, html, other]
-
Title: Constrained $L^p$ Approximation of Shape Tensors and its Role for the Determination of Shape GradientsSubjects: Numerical Analysis (math.NA); Optimization and Control (math.OC)
This paper extends our earlier work [arXiv:2309.13595] on the $L^p$ approximation of the shape tensor by Laurain and Sturm. In particular, it is shown that the weighted $L^p$ distance to an affine space of admissible symmetric shape tensors satisfying a divergence constraint provides the shape gradient with respect to the $L^{p^\ast}$-norm (where $1/p + 1/p^\ast = 1$) of the elastic strain associated with the shape deformation. This approach allows the combination of two ingredients which have already been used successfully in numerical shape optimization: (i) departing from the Hilbert space framework towards the Lipschitz topology approximated by $W^{1,p^\ast}$ with $p^\ast > 2$ and (ii) using the symmetric rather than the full gradient to define the norm. Similarly to [arXiv:2309.13595], the $L^p$ distance measures the shape stationarity by means of the dual norm of the shape derivative with respect to the above-mentioned $L^{p^\ast}$-norm of the elastic strain. Moreover, the Lagrange multiplier for the momentum balance constraint constitute the steepest descent deformation with respect to this norm. The finite element realization of this approach is done using the weakly symmetric PEERS element and its three-dimensional counterpart, respectively. The resulting piecewise constant approximation for the Lagrange multiplier is reconstructed to a shape gradient in $W^{1,p^\ast}$ and used in an iterative procedure towards the optimal shape.
- [712] arXiv:2406.14408 [pdf, html, other]
-
Title: FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem ProvingXiaohan Lin, Qingxing Cao, Yinya Huang, Haiming Wang, Jianqiao Lu, Zhengying Liu, Linqi Song, Xiaodan LiangSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Mathematical Software (cs.MS)
Formal verification (FV) has witnessed growing significance with current emerging program synthesis by the evolving large language models (LLMs). However, current formal verification mainly resorts to symbolic verifiers or hand-craft rules, resulting in limitations for extensive and flexible verification. On the other hand, formal languages for automated theorem proving, such as Isabelle, as another line of rigorous verification, are maintained with comprehensive rules and theorems. In this paper, we propose FVEL, an interactive Formal Verification Environment with LLMs. Specifically, FVEL transforms a given code to be verified into Isabelle, and then conducts verification via neural automated theorem proving with an LLM. The joined paradigm leverages the rigorous yet abundant formulated and organized rules in Isabelle and is also convenient for introducing and adjusting cutting-edge LLMs. To achieve this goal, we extract a large-scale FVELER3. The FVELER dataset includes code dependencies and verification processes that are formulated in Isabelle, containing 758 theories, 29,125 lemmas, and 200,646 proof steps in total with in-depth dependencies. We benchmark FVELER in the FVEL environment by first fine-tuning LLMs with FVELER and then evaluating them on Code2Inv and SV-COMP. The results show that FVEL with FVELER fine-tuned Llama3- 8B solves 17.39% (69 -> 81) more problems, and Mistral-7B 12% (75 -> 84) more problems in SV-COMP. And the proportion of proof errors is reduced. Project page: this https URL.
- [713] arXiv:2406.14412 [pdf, html, other]
-
Title: Benchmarking Monocular 3D Dog Pose Estimation Using In-The-Wild Motion Capture DataComments: 5 pages, 8 figures, including supplementary, CV4Animals Workshop 2024 (CVPRW)Subjects: Computer Vision and Pattern Recognition (cs.CV)
We introduce a new benchmark analysis focusing on 3D canine pose estimation from monocular in-the-wild images. A multi-modal dataset 3DDogs-Lab was captured indoors, featuring various dog breeds trotting on a walkway. It includes data from optical marker-based mocap systems, RGBD cameras, IMUs, and a pressure mat. While providing high-quality motion data, the presence of optical markers and limited background diversity make the captured video less representative of real-world conditions. To address this, we created 3DDogs-Wild, a naturalised version of the dataset where the optical markers are in-painted and the subjects are placed in diverse environments, enhancing its utility for training RGB image-based pose detectors. We show that using the 3DDogs-Wild to train the models leads to improved performance when evaluating on in-the-wild data. Additionally, we provide a thorough analysis using various pose estimation models, revealing their respective strengths and weaknesses. We believe that our findings, coupled with the datasets provided, offer valuable insights for advancing 3D animal pose estimation.
- [714] arXiv:2406.14415 [pdf, html, other]
-
Title: Vectorized Representation Dreamer (VRD): Dreaming-Assisted Multi-Agent Motion-ForecastingComments: Accepted for publication in IEEE Intelligent Vehicle Symposium (IV 2024)Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
For an autonomous vehicle to plan a path in its environment, it must be able to accurately forecast the trajectory of all dynamic objects in its proximity. While many traditional methods encode observations in the scene to solve this problem, there are few approaches that consider the effect of the ego vehicle's behavior on the future state of the world. In this paper, we introduce VRD, a vectorized world model-inspired approach to the multi-agent motion forecasting problem. Our method combines a traditional open-loop training regime with a novel dreamed closed-loop training pipeline that leverages a kinematic reconstruction task to imagine the trajectory of all agents, conditioned on the action of the ego vehicle. Quantitative and qualitative experiments are conducted on the Argoverse 2 multi-world forecasting evaluation dataset and the intersection drone (inD) dataset to demonstrate the performance of our proposed model. Our model achieves state-of-the-art performance on the single prediction miss rate metric on the Argoverse 2 dataset and performs on par with the leading models for the single prediction displacement metrics.
- [715] arXiv:2406.14418 [pdf, html, other]
-
Title: Worst-Case Learning under a Multi-fidelity ModelSubjects: Numerical Analysis (math.NA)
Inspired by multi-fidelity methods in computer simulations, this article introduces procedures to design surrogates for the input/output relationship of a high-fidelity code. These surrogates should be learned from runs of both the high-fidelity and low-fidelity codes and be accompanied by error guarantees that are deterministic rather than stochastic. For this purpose, the article advocates a framework tied to a theory focusing on worst-case guarantees, namely Optimal Recovery. The multi-fidelity considerations triggered new theoretical results in three scenarios: the globally optimal estimation of linear functionals, the globally optimal approximation of arbitrary quantities of interest in Hilbert spaces, and their locally optimal approximation, still within Hilbert spaces. The latter scenario boils down to the determination of the Chebyshev center for the intersection of two hyperellipsoids. It is worth noting that the mathematical framework presented here, together with its possible extension, seems to be relevant in several other contexts briefly discussed.
- [716] arXiv:2406.14420 [pdf, html, other]
-
Title: Communication-efficient Vertical Federated Learning via Compressed Error FeedbackSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC)
Communication overhead is a known bottleneck in federated learning (FL). To address this, lossy compression is commonly used on the information communicated between the server and clients during training. In horizontal FL, where each client holds a subset of the samples, such communication-compressed training methods have recently seen significant progress. However, in their vertical FL counterparts, where each client holds a subset of the features, our understanding remains limited. To address this, we propose an error feedback compressed vertical federated learning (EFVFL) method to train split neural networks. In contrast with previous communication-compressed methods for vertical FL, EFVFL does not require a vanishing compression error for the gradient norm to converge to zero for smooth nonconvex problems. By leveraging error feedback, our method can achieve a $\mathcal{O}(1/T)$ convergence rate in the full-batch case, improving over the state-of-the-art $\mathcal{O}(1/\sqrt{T})$ rate under $\mathcal{O}(1/\sqrt{T})$ compression error, and matching the rate of uncompressed methods. Further, when the objective function satisfies the Polyak-Łojasiewicz inequality, our method converges linearly. In addition to improving convergence rates, our method also supports the use of private labels. Numerical experiments show that EFVFL significantly improves over the prior art, confirming our theoretical results.
- [717] arXiv:2406.14422 [pdf, html, other]
-
Title: FutureNet-LOF: Joint Trajectory Prediction and Lane Occupancy Field Prediction with Future Context EncodingMingkun Wang, Xiaoguang Ren, Ruochun Jin, Minglong Li, Xiaochuan Zhang, Changqian Yu, Mingxu Wang, Wenjing YangComments: 10 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Most prior motion prediction endeavors in autonomous driving have inadequately encoded future scenarios, leading to predictions that may fail to accurately capture the diverse movements of agents (e.g., vehicles or pedestrians). To address this, we propose FutureNet, which explicitly integrates initially predicted trajectories into the future scenario and further encodes these future contexts to enhance subsequent forecasting. Additionally, most previous motion forecasting works have focused on predicting independent futures for each agent. However, safe and smooth autonomous driving requires accurately predicting the diverse future behaviors of numerous surrounding agents jointly in complex dynamic environments. Given that all agents occupy certain potential travel spaces and possess lane driving priority, we propose Lane Occupancy Field (LOF), a new representation with lane semantics for motion forecasting in autonomous driving. LOF can simultaneously capture the joint probability distribution of all road participants' future spatial-temporal positions. Due to the high compatibility between lane occupancy field prediction and trajectory prediction, we propose a novel network with future context encoding for the joint prediction of these two tasks. Our approach ranks 1st on two large-scale motion forecasting benchmarks: Argoverse 1 and Argoverse 2.
- [718] arXiv:2406.14424 [pdf, html, other]
-
Title: CascadeServe: Unlocking Model Cascades for Inference ServingComments: 17 pages, 13 figuresSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Machine learning (ML) models are increasingly deployed to production, calling for efficient inference serving systems. Efficient inference serving is complicated by two challenges: (i) ML models incur high computational costs, and (ii) the request arrival rates of practical applications have frequent, high, and sudden variations which make it hard to correctly provision hardware. Model cascades are positioned to tackle both of these challenges, as they (i) save work while maintaining accuracy, and (ii) expose a high-resolution trade-off between work and accuracy, allowing for fine-grained adjustments to request arrival rates. Despite their potential, model cascades haven't been used inside an online serving system. This comes with its own set of challenges, including workload adaption, model replication onto hardware, inference scheduling, request batching, and more. In this work, we propose CascadeServe, which automates and optimizes end-to-end inference serving with cascades. CascadeServe operates in an offline and online phase. In the offline phase, the system pre-computes a gear plan that specifies how to serve inferences online. In the online phase, the gear plan allows the system to serve inferences while making near-optimal adaptations to the query load at negligible decision overheads. We find that CascadeServe saves 2-3x in cost across a wide spectrum of the latency-accuracy space when compared to state-of-the-art baselines on different workloads.
- [719] arXiv:2406.14425 [pdf, html, other]
-
Title: SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource LanguagesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Question Answering (QA) datasets have been instrumental in developing and evaluating Large Language Model (LLM) capabilities. However, such datasets are scarce for languages other than English due to the cost and difficulties of collection and manual annotation. This means that producing novel models and measuring the performance of multilingual LLMs in low-resource languages is challenging. To mitigate this, we propose $\textbf{S}$yn$\textbf{DAR}$in, a method for generating and validating QA datasets for low-resource languages. We utilize parallel content mining to obtain $\textit{human-curated}$ paragraphs between English and the target language. We use the English data as context to $\textit{generate}$ synthetic multiple-choice (MC) question-answer pairs, which are automatically translated and further validated for quality. Combining these with their designated non-English $\textit{human-curated}$ paragraphs form the final QA dataset. The method allows to maintain the content quality, reduces the likelihood of factual errors, and circumvents the need for costly annotation. To test the method, we created a QA dataset with $1.2$K samples for the Armenian language. The human evaluation shows that $98\%$ of the generated English data maintains quality and diversity in the question types and topics, while the translation validation pipeline can filter out $\sim70\%$ of data with poor quality. We use the dataset to benchmark state-of-the-art LLMs, showing their inability to achieve human accuracy with some model performances closer to random chance. This shows that the generated dataset is non-trivial and can be used to evaluate reasoning capabilities in low-resource language.
- [720] arXiv:2406.14427 [pdf, html, other]
-
Title: Control when confidence is costlyComments: 9 pages, 4 figures, submitted to NeurIPS 2024Subjects: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
We develop a version of stochastic control that accounts for computational costs of inference. Past studies identified efficient coding without control, or efficient control that neglects the cost of synthesizing information. Here we combine these concepts into a framework where agents rationally approximate inference for efficient control. Specifically, we study Linear Quadratic Gaussian (LQG) control with an added internal cost on the relative precision of the posterior probability over the world state. This creates a trade-off: an agent can obtain more utility overall by sacrificing some task performance, if doing so saves enough bits during inference. We discover that the rational strategy that solves the joint inference and control problem goes through phase transitions depending on the task demands, switching from a costly but optimal inference to a family of suboptimal inferences related by rotation transformations, each misestimate the stability of the world. In all cases, the agent moves more to think less. This work provides a foundation for a new type of rational computations that could be used by both brains and machines for efficient but computationally constrained control.
- [721] arXiv:2406.14429 [pdf, html, other]
-
Title: CollaFuse: Collaborative Diffusion ModelsComments: 13 pages, 7 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
In the landscape of generative artificial intelligence, diffusion-based models have emerged as a promising method for generating synthetic images. However, the application of diffusion models poses numerous challenges, particularly concerning data availability, computational requirements, and privacy. Traditional approaches to address these shortcomings, like federated learning, often impose significant computational burdens on individual clients, especially those with constrained resources. In response to these challenges, we introduce a novel approach for distributed collaborative diffusion models inspired by split learning. Our approach facilitates collaborative training of diffusion models while alleviating client computational burdens during image synthesis. This reduced computational burden is achieved by retaining data and computationally inexpensive processes locally at each client while outsourcing the computationally expensive processes to shared, more efficient server resources. Through experiments on the common CelebA dataset, our approach demonstrates enhanced privacy by reducing the necessity for sharing raw data. These capabilities hold significant potential across various application areas, including the design of edge computing solutions. Thus, our work advances distributed machine learning by contributing to the evolution of collaborative diffusion models.
- [722] arXiv:2406.14430 [pdf, html, other]
-
Title: Adaptive Deep Neural Network-Based Control Barrier FunctionsComments: 7 pages, 2 figures, 28 referencesSubjects: Systems and Control (eess.SY)
Safety constraints of nonlinear control systems are commonly enforced through the use of control barrier functions (CBFs). Uncertainties in the dynamic model can disrupt forward invariance guarantees or cause the state to be restricted to an overly conservative subset of the safe set. In this paper, adaptive deep neural networks (DNNs) are combined with CBFs to produce a family of controllers that ensure safety while learning the system's dynamics in real-time without the requirement for pre-training. By basing the least squares adaptation law on a state derivative estimator-based identification error, the DNN parameter estimation error is shown to be uniformly ultimately bounded. The convergent bound on the parameter estimation error is then used to formulate CBF-constraints in an optimization-based controller to guarantee safety despite model uncertainty. Furthermore, the developed method is applicable for use under intermittent loss of state-feedback. Comparative simulation results demonstrate the ability of the developed method to ensure safety in an adaptive cruise control problem and when feedback is lost, unlike baseline methods.
- [723] arXiv:2406.14434 [pdf, html, other]
-
Title: Towards Truthful Multilingual Large Language Models: Benchmarking and Alignment StrategiesComments: 15 pagesSubjects: Computation and Language (cs.CL)
In the era of large language models (LLMs), building multilingual large language models (MLLMs) that can serve users worldwide holds great significance. However, existing research seldom focuses on the truthfulness of MLLMs. Meanwhile, contemporary multilingual aligning technologies struggle to balance massive languages and often exhibit serious truthfulness gaps across different languages, especially those that differ greatly from English. In our work, we construct a benchmark for truthfulness evaluation in multilingual scenarios and explore the ways to align facts across languages to enhance the truthfulness of MLLMs. Furthermore, we propose Fact-aware Multilingual Selective Synergy (FaMSS) to optimize the data allocation across a large number of languages and different data types. Experimental results demonstrate that our approach can effectively reduce the multilingual representation disparity and enhance the multilingual capabilities of LLMs.
- [724] arXiv:2406.14436 [pdf, html, other]
-
Title: Video Generation with Learned Action PriorSubjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Stochastic video generation is particularly challenging when the camera is mounted on a moving platform, as camera motion interacts with observed image pixels, creating complex spatio-temporal dynamics and making the problem partially observable. Existing methods typically address this by focusing on raw pixel-level image reconstruction without explicitly modelling camera motion dynamics. We propose a solution by considering camera motion or action as part of the observed image state, modelling both image and action within a multi-modal learning framework. We introduce three models: Video Generation with Learning Action Prior (VG-LeAP) treats the image-action pair as an augmented state generated from a single latent stochastic process and uses variational inference to learn the image-action latent prior; Causal-LeAP, which establishes a causal relationship between action and the observed image frame at time $t$, learning an action prior conditioned on the observed image states; and RAFI, which integrates the augmented image-action state concept into flow matching with diffusion generative processes, demonstrating that this action-conditioned image generation concept can be extended to other diffusion-based models. We emphasize the importance of multi-modal training in partially observable video generation problems through detailed empirical studies on our new video action dataset, RoAM.
- [725] arXiv:2406.14441 [pdf, html, other]
-
Title: Vahana.jl -- A framework (not only) for large-scale agent-based modelsSubjects: Multiagent Systems (cs.MA); Distributed, Parallel, and Cluster Computing (cs.DC)
Agent-based models (ABMs) offer a powerful framework for understanding complex systems. However, their computational demands often become a significant barrier as the number of agents and complexity of the simulation increase. Traditional ABM platforms often struggle to fully exploit modern computing resources, hindering the development of large-scale simulations. This paper presents Vahana.jl, a high performance computing open source framework that aims to address these limitations. Building on the formalism of synchronous graph dynamical systems, Vahana.jl is especially well suited for models with a focus on (social) networks. The framework seamlessly supports distribution across multiple compute nodes, enabling simulations that would otherwise be beyond the capabilities of a single machine. Implemented in Julia, Vahana.jl leverages the interactive Read-Eval-Print Loop (REPL) environment, facilitating rapid model development and experimentation.
- [726] arXiv:2406.14442 [pdf, html, other]
-
Title: Graph Representation Learning Strategies for Omics Data: A Case Study on Parkinson's DiseaseElisa Gómez de Lope (1), Saurabh Deshpande (1), Ramón Viñas Torné (2), Pietro Liò (3), Enrico Glaab (1 and 4), Stéphane P. A. Bordas (1) ((1) University of Luxembourg, (2) École polytechnique fédérale de Lausanne (EPFL), (3) University of Cambridge, (4) On behalf of the NCER-PD Consortium)Comments: Submitted to Machine Learning in Computational Biology 2024 as an extended abstract, 2 pages + 1 appendixSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Biomolecules (q-bio.BM); Molecular Networks (q-bio.MN)
Omics data analysis is crucial for studying complex diseases, but its high dimensionality and heterogeneity challenge classical statistical and machine learning methods. Graph neural networks have emerged as promising alternatives, yet the optimal strategies for their design and optimization in real-world biomedical challenges remain unclear. This study evaluates various graph representation learning models for case-control classification using high-throughput biological data from Parkinson's disease and control samples. We compare topologies derived from sample similarity networks and molecular interaction networks, including protein-protein and metabolite-metabolite interactions (PPI, MMI). Graph Convolutional Network (GCNs), Chebyshev spectral graph convolution (ChebyNet), and Graph Attention Network (GAT), are evaluated alongside advanced architectures like graph transformers, the graph U-net, and simpler models like multilayer perceptron (MLP).
These models are systematically applied to transcriptomics and metabolomics data independently. Our comparative analysis highlights the benefits and limitations of various architectures in extracting patterns from omics data, paving the way for more accurate and interpretable models in biomedical research. - [727] arXiv:2406.14446 [pdf, html, other]
-
Title: Maintenance Required: Updating and Extending Bootstrapped Human Activity Recognition Systems for Smart HomesComments: 12 pages, 5 figures, accepted at The 6th International Conference on Activity and Behavior Computing, under print at IEEE ExploreSubjects: Machine Learning (cs.LG)
Developing human activity recognition (HAR) systems for smart homes is not straightforward due to varied layouts of the homes and their personalized settings, as well as idiosyncratic behaviors of residents. As such, off-the-shelf HAR systems are effective in limited capacity for an individual home, and HAR systems often need to be derived "from scratch", which comes with substantial efforts and often is burdensome to the resident. Previous work has successfully targeted the initial phase. At the end of this initial phase, we identify seed points. We build on bootstrapped HAR systems and introduce an effective updating and extension procedure for continuous improvement of HAR systems with the aim of keeping up with ever changing life circumstances. Our method makes use of the seed points identified at the end of the initial bootstrapping phase. A contrastive learning framework is trained using these seed points and labels obtained for the same. This model is then used to improve the segmentation accuracy of the identified prominent activities. Improvements in the activity recognition system through this procedure help model the majority of the routine activities in the smart home. We demonstrate the effectiveness of our procedure through experiments on the CASAS datasets that show the practical value of our approach.
- [728] arXiv:2406.14449 [pdf, html, other]
-
Title: APEER: Automatic Prompt Engineering Enhances Large Language Model RerankingCan Jin, Hongwu Peng, Shiyu Zhao, Zhenting Wang, Wujiang Xu, Ligong Han, Jiahui Zhao, Kai Zhong, Sanguthevar Rajasekaran, Dimitris N. MetaxasSubjects: Artificial Intelligence (cs.AI)
Large Language Models (LLMs) have significantly enhanced Information Retrieval (IR) across various modules, such as reranking. Despite impressive performance, current zero-shot relevance ranking with LLMs heavily relies on human prompt engineering. Existing automatic prompt engineering algorithms primarily focus on language modeling and classification tasks, leaving the domain of IR, particularly reranking, underexplored. Directly applying current prompt engineering algorithms to relevance ranking is challenging due to the integration of query and long passage pairs in the input, where the ranking complexity surpasses classification tasks. To reduce human effort and unlock the potential of prompt optimization in reranking, we introduce a novel automatic prompt engineering algorithm named APEER. APEER iteratively generates refined prompts through feedback and preference optimization. Extensive experiments with four LLMs and ten datasets demonstrate the substantial performance improvement of APEER over existing state-of-the-art (SoTA) manual prompts. Furthermore, we find that the prompts generated by APEER exhibit better transferability across diverse tasks and LLMs. Code is available at this https URL.
- [729] arXiv:2406.14452 [pdf, html, other]
-
Title: Science in a Blink: Supporting Ensemble Perception in Scalar FieldsComments: To appear in Proceedings of the 2024 IEEE Visualization Conference (VIS'24)Subjects: Human-Computer Interaction (cs.HC)
Visualizations support rapid analysis of scientific datasets, allowing viewers to glean aggregate information (e.g., the mean) within split-seconds. While prior research has explored this ability in conventional charts, it is unclear if spatial visualizations used by computational scientists afford a similar ensemble perception capacity. We investigate people's ability to estimate two summary statistics, mean and variance, from pseudocolor scalar fields. In a crowdsourced experiment, we find that participants can reliably characterize both statistics, although variance discrimination requires a much stronger signal. Multi-hue and diverging colormaps outperformed monochromatic, luminance ramps in aiding this extraction. Analysis of qualitative responses suggests that participants often estimate the distribution of hotspots and valleys as visual proxies for data statistics. These findings suggest that people's summary interpretation of spatial datasets is likely driven by the appearance of discrete color segments, rather than assessments of overall luminance. Implicit color segmentation in quantitative displays could thus prove more useful than previously assumed by facilitating quick, gist-level judgments about color-coded visualizations.
- [730] arXiv:2406.14455 [pdf, html, other]
-
Title: MM-GTUNets: Unified Multi-Modal Graph Deep Learning for Brain Disorders PredictionLuhui Cai, Weiming Zeng, Hongyu Chen, Hua Zhang, Yueyang Li, Hongjie Yan, Lingbin Bian, Nizhuan WangSubjects: Computer Vision and Pattern Recognition (cs.CV)
Graph deep learning (GDL) has demonstrated impressive performance in predicting population-based brain disorders (BDs) through the integration of both imaging and non-imaging data. However, the effectiveness of GDL based methods heavily depends on the quality of modeling the multi-modal population graphs and tends to degrade as the graph scale increases. Furthermore, these methods often constrain interactions between imaging and non-imaging data to node-edge interactions within the graph, overlooking complex inter-modal correlations, leading to suboptimal outcomes. To overcome these challenges, we propose MM-GTUNets, an end-to-end graph transformer based multi-modal graph deep learning (MMGDL) framework designed for brain disorders prediction at large scale. Specifically, to effectively leverage rich multi-modal information related to diseases, we introduce Modality Reward Representation Learning (MRRL) which adaptively constructs population graphs using a reward system. Additionally, we employ variational autoencoder to reconstruct latent representations of non-imaging features aligned with imaging features. Based on this, we propose Adaptive Cross-Modal Graph Learning (ACMGL), which captures critical modality-specific and modality-shared features through a unified GTUNet encoder taking advantages of Graph UNet and Graph Transformer, and feature fusion module. We validated our method on two public multi-modal datasets ABIDE and ADHD-200, demonstrating its superior performance in diagnosing BDs. Our code is available at this https URL.
- [731] arXiv:2406.14456 [pdf, html, other]
-
Title: Capturing Temporal Components for Time Series ClassificationComments: 15 pages, 2 figures, 4 tablesSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Analyzing sequential data is crucial in many domains, particularly due to the abundance of data collected from the Internet of Things paradigm. Time series classification, the task of categorizing sequential data, has gained prominence, with machine learning approaches demonstrating remarkable performance on public benchmark datasets. However, progress has primarily been in designing architectures for learning representations from raw data at fixed (or ideal) time scales, which can fail to generalize to longer sequences. This work introduces a \textit{compositional representation learning} approach trained on statistically coherent components extracted from sequential data. Based on a multi-scale change space, an unsupervised approach is proposed to segment the sequential data into chunks with similar statistical properties. A sequence-based encoder model is trained in a multi-task setting to learn compositional representations from these temporal components for time series classification. We demonstrate its effectiveness through extensive experiments on publicly available time series classification benchmarks. Evaluating the coherence of segmented components shows its competitive performance on the unsupervised segmentation task.
- [732] arXiv:2406.14457 [pdf, html, other]
-
Title: Rewarding What Matters: Step-by-Step Reinforcement Learning for Task-Oriented DialogueSubjects: Artificial Intelligence (cs.AI)
Reinforcement learning (RL) is a powerful approach to enhance task-oriented dialogue (TOD) systems. However, existing RL methods tend to mainly focus on generation tasks, such as dialogue policy learning (DPL) or response generation (RG), while neglecting dialogue state tracking (DST) for understanding. This narrow focus limits the systems to achieve globally optimal performance by overlooking the interdependence between understanding and generation. Additionally, RL methods face challenges with sparse and delayed rewards, which complicates training and optimization. To address these issues, we extend RL into both understanding and generation tasks by introducing step-by-step rewards throughout the token generation. The understanding reward increases as more slots are correctly filled in DST, while the generation reward grows with the accurate inclusion of user requests. Our approach provides a balanced optimization aligned with task completion. Experimental results demonstrate that our approach effectively enhances the performance of TOD systems and achieves new state-of-the-art results on three widely used datasets, including MultiWOZ2.0, MultiWOZ2.1, and In-Car. Our approach also shows superior few-shot ability in low-resource settings compared to current models.
- [733] arXiv:2406.14458 [pdf, html, other]
-
Title: Centimeter Positioning Accuracy using AI/ML for 6G ApplicationsComments: 2 Pages, 2 Figures, ICMLCN Conference, Stockholm, SwedenSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Signal Processing (eess.SP)
This research looks at using AI/ML to achieve centimeter-level user positioning in 6G applications such as the Industrial Internet of Things (IIoT). Initial results show that our AI/ML-based method can estimate user positions with an accuracy of 17 cm in an indoor factory environment. In this proposal, we highlight our approaches and future directions.
- [734] arXiv:2406.14459 [pdf, html, other]
-
Title: Healing Powers of BERT: How Task-Specific Fine-Tuning Recovers Corrupted Language ModelsSubjects: Computation and Language (cs.CL)
Language models like BERT excel at sentence classification tasks due to extensive pre-training on general data, but their robustness to parameter corruption is unexplored. To understand this better, we look at what happens if a language model is "broken", in the sense that some of its parameters are corrupted and then recovered by fine-tuning. Strategically corrupting BERT variants at different levels, we find corrupted models struggle to fully recover their original performance, with higher corruption causing more severe degradation. Notably, bottom-layer corruption affecting fundamental linguistic features is more detrimental than top-layer corruption. Our insights contribute to understanding language model robustness and adaptability under adverse conditions, informing strategies for developing resilient NLP systems against parameter perturbations.
- [735] arXiv:2406.14460 [pdf, html, other]
-
Title: Podcast Outcasts: Understanding Rumble's Podcast DynamicsSubjects: Social and Information Networks (cs.SI)
Podcasting on Rumble, an alternative video-sharing platform, attracts controversial figures known for spreading divisive and often misleading content, which sharply contrasts with YouTube's more regulated environment. Motivated by the growing impact of podcasts on political discourse, as seen with figures like Joe Rogan and Andrew Tate, this paper explores the political biases and content strategies used by these platforms. In this paper, we conduct a comprehensive analysis of over 13K podcast videos from both YouTube and Rumble, focusing on their political content and the dynamics of their audiences. Using advanced speech-to-text transcription, topic modeling, and contrastive learning techniques, we explore three critical aspects: the presence of political bias in podcast channels, the nature of content that drives podcast views, and the usage of visual elements in these podcasts. Our findings reveal a distinct right-wing orientation in Rumble's podcasts, contrasting with YouTube's more diverse and apolitical content.
- [736] arXiv:2406.14462 [pdf, html, other]
-
Title: Explicit and Implicit Large Language Model Personas Generate Opinions but Fail to Replicate Deeper Perceptions and BiasesSalvatore Giorgi, Tingting Liu, Ankit Aich, Kelsey Isman, Garrick Sherman, Zachary Fried, João Sedoc, Lyle H. Ungar, Brenda CurtisSubjects: Computation and Language (cs.CL)
Large language models (LLMs) are increasingly being used in human-centered social scientific tasks, such as data annotation, synthetic data creation, and engaging in dialog. However, these tasks are highly subjective and dependent on human factors, such as one's environment, attitudes, beliefs, and lived experiences. Thus, employing LLMs (which do not have such human factors) in these tasks may result in a lack of variation in data, failing to reflect the diversity of human experiences. In this paper, we examine the role of prompting LLMs with human-like personas and asking the models to answer as if they were a specific human. This is done explicitly, with exact demographics, political beliefs, and lived experiences, or implicitly via names prevalent in specific populations. The LLM personas are then evaluated via (1) subjective annotation task (e.g., detecting toxicity) and (2) a belief generation task, where both tasks are known to vary across human factors. We examine the impact of explicit vs. implicit personas and investigate which human factors LLMs recognize and respond to. Results show that LLM personas show mixed results when reproducing known human biases, but generate generally fail to demonstrate implicit biases. We conclude that LLMs lack the intrinsic cognitive mechanisms of human thought, while capturing the statistical patterns of how people speak, which may restrict their effectiveness in complex social science applications.
- [737] arXiv:2406.14464 [pdf, html, other]
-
Title: A Review of Common Online Speaker Diarization MethodsComments: 6 pagesSubjects: Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Speaker diarization provides the answer to the question "who spoke when?" for an audio file. This information can be used to complete audio transcripts for further processing steps. Most speaker diarization systems assume that the audio file is available as a whole. However, there are scenarios in which the speaker labels are needed immediately after the arrival of an audio segment. Speaker diarization with a correspondingly low latency is referred to as online speaker diarization. This paper provides an overview. First the history of online speaker diarization is briefly presented. Next a taxonomy and datasets for training and evaluation are given. In the sections that follow, online diarization methods and systems are discussed in detail. This paper concludes with the presentation of challenges that still need to be solved by future research in the field of online speaker diarization.
- [738] arXiv:2406.14469 [pdf, html, other]
-
Title: Fusion of Movement and Naive Predictions for Point Forecasting in Univariate Random WalksSubjects: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Traditional methods for point forecasting in univariate random walks often fail to surpass naive benchmarks due to data unpredictability. This study introduces a novel forecasting method that fuses movement prediction (binary classification) with naive forecasts for accurate one-step-ahead point forecasting. The method's efficacy is demonstrated through theoretical analysis, simulations, and real-world data experiments. It reliably exceeds naive forecasts with movement prediction accuracies as low as 0.55, outperforming baseline models like ARIMA, linear regression, MLP, and LSTM networks in forecasting the S\&P 500 index and Bitcoin prices. This method is particularly advantageous when accurate point predictions are challenging but accurate movement predictions are attainable, translating movement predictions into point forecasts in random walk contexts.
- [739] arXiv:2406.14470 [pdf, html, other]
-
Title: Model-driven realization of IDTA submodel specifications: The good, the bad, the incompatible?Comments: 8 pages, 1 figureSubjects: Software Engineering (cs.SE)
Asset Administration Shells are trending in Industry 4.0. In February 2024, the Industrial Digital Twin Association announced 84 and released 18 AAS submodel specifications. As an enabler on programming level, dedicated APIs are needed, for which, at this level of scale, automated creation is desirable. In this paper, we present a model-driven approach, which transforms extracted information from IDTA specifications into an intermediary meta-model and, from there, generates API code and tests. We show we can process all current IDTA specifications successfully leading in total to more than 50000 lines of code. However, syntactical variations and issues in the specifications impose obstacles that require human intervention or AI support. We also discuss experiences that we made and lessons learned.
- [740] arXiv:2406.14472 [pdf, html, other]
-
Title: Self-supervised Multi-actor Social Activity Understanding in Streaming VideosComments: 16 pages, 2 figures, 4 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV)
This work addresses the problem of Social Activity Recognition (SAR), a critical component in real-world tasks like surveillance and assistive robotics. Unlike traditional event understanding approaches, SAR necessitates modeling individual actors' appearance and motions and contextualizing them within their social interactions. Traditional action localization methods fall short due to their single-actor, single-action assumption. Previous SAR research has relied heavily on densely annotated data, but privacy concerns limit their applicability in real-world settings. In this work, we propose a self-supervised approach based on multi-actor predictive learning for SAR in streaming videos. Using a visual-semantic graph structure, we model social interactions, enabling relational reasoning for robust performance with minimal labeled data. The proposed framework achieves competitive performance on standard group activity recognition benchmarks. Evaluation on three publicly available action localization benchmarks demonstrates its generalizability to arbitrary action localization.
- [741] arXiv:2406.14473 [pdf, html, other]
-
Title: Data-Centric AI in the Age of Large Language ModelsXinyi Xu, Zhaoxuan Wu, Rui Qiao, Arun Verma, Yao Shu, Jingtan Wang, Xinyuan Niu, Zhenfeng He, Jiangwei Chen, Zijian Zhou, Gregory Kang Ruey Lau, Hieu Dao, Lucas Agussurja, Rachael Hwee Ling Sim, Xiaoqiang Lin, Wenyang Hu, Zhongxiang Dai, Pang Wei Koh, Bryan Kian Hsiang LowComments: PreprintSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs). We start by making the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs, and yet it receives disproportionally low attention from the research community. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization. In each scenario, we underscore the importance of data, highlight promising research directions, and articulate the potential impacts on the research community and, where applicable, the society as a whole. For instance, we advocate for a suite of data-centric benchmarks tailored to the scale and complexity of data for LLMs. These benchmarks can be used to develop new data curation methods and document research efforts and results, which can help promote openness and transparency in AI and LLM research.
- [742] arXiv:2406.14474 [pdf, html, other]
-
Title: Spatio-temporal Patterns between ENSO and Weather-related Power Outages in the Continental United StatesSubjects: Systems and Control (eess.SY)
El Niño-Southern Oscillation (ENSO) exhibits significant impacts on the frequency of extreme weather events and its socio-economic implications prevail on a global scale. However, a fundamental gap still exists in understanding the relationship between the ENSO and weather-related power outages in the continental United States. Through 24-year (2000-2023) composite and statistical analysis, our study reveals that higher power outage numbers (PONs) are observed from the developing winter to the decaying summer of La Niña phases. In particular, during the decaying spring, high La Niña intensity favors the occurrences of power outage over the west coast and east of the United States, by modulating the frequency of extreme precipitations and heatwaves. Furthermore, projected increasing heatwaves from the Coupled Model Intercomparison Project Phase 6 (CMIP6) indicate that spring-time PONs over the eastern United States occur about 11 times higher for the mid-term future (2041-2060) and almost 26 times higher for the long-term future (2081-2100), compared with 2000-2023. Our study provides a strong recommendation for building a more climate-resilient power system.
- [743] arXiv:2406.14476 [pdf, html, other]
-
Title: Learning telic-controllable state representationsSubjects: Artificial Intelligence (cs.AI)
Computational accounts of purposeful behavior consist of descriptive and normative aspects. The former enable agents to ascertain the current (or future) state of affairs in the world and the latter to evaluate the desirability, or lack thereof, of these states with respect to the agent's goals. In Reinforcement Learning, the normative aspect (reward and value functions) is assumed to depend on a pre-defined and fixed descriptive one (state representation). Alternatively, these two aspects may emerge interdependently: goals can be, and indeed often are, expressed in terms of state representation features, but they may also serve to shape state representations themselves. Here, we illustrate a novel theoretical framing of state representation learning in bounded agents, coupling descriptive and normative aspects via the notion of goal-directed, or telic, states. We define a new controllability property of telic state representations to characterize the tradeoff between their granularity and the policy complexity capacity required to reach all telic states. We propose an algorithm for learning controllable state representations and demonstrate it using a simple navigation task with changing goals. Our framework highlights the crucial role of deliberate ignorance - knowing what to ignore - for learning state representations that are both goal-flexible and simple. More broadly, our work provides a concrete step towards a unified theoretical view of natural and artificial learning through the lens of goals.
- [744] arXiv:2406.14477 [pdf, other]
-
Title: SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference DatasetSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Databases (cs.DB)
To mitigate the risk of harmful outputs from large vision models (LVMs), we introduce the SafeSora dataset to promote research on aligning text-to-video generation with human values. This dataset encompasses human preferences in text-to-video generation tasks along two primary dimensions: helpfulness and harmlessness. To capture in-depth human preferences and facilitate structured reasoning by crowdworkers, we subdivide helpfulness into 4 sub-dimensions and harmlessness into 12 sub-categories, serving as the basis for pilot annotations. The SafeSora dataset includes 14,711 unique prompts, 57,333 unique videos generated by 4 distinct LVMs, and 51,691 pairs of preference annotations labeled by humans. We further demonstrate the utility of the SafeSora dataset through several applications, including training the text-video moderation model and aligning LVMs with human preference by fine-tuning a prompt augmentation module or the diffusion model. These applications highlight its potential as the foundation for text-to-video alignment research, such as human preference modeling and the development and validation of alignment algorithms.
- [745] arXiv:2406.14478 [pdf, html, other]
-
Title: Toward data-driven research: preliminary study to predict surface roughness in material extrusion using previously published data with Machine LearningSubjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
Material extrusion is one of the most commonly used approaches within the additive manufacturing processes available. Despite its popularity and related technical advancements, process reliability and quality assurance remain only partially solved. In particular, the surface roughness caused by this process is a key concern. To solve this constraint, experimental plans have been exploited to optimize surface roughness in recent years. However, the latter empirical trial and error process is extremely time- and resource-consuming. Thus, this study aims to avoid using large experimental programs to optimize surface roughness in material extrusion.
Methodology. This research provides an in-depth analysis of the effect of several printing parameters: layer height, printing temperature, printing speed and wall thickness. The proposed data-driven predictive modeling approach takes advantage of Machine Learning models to automatically predict surface roughness based on the data gathered from the literature and the experimental data generated for testing.
Findings. Using 10-fold cross-validation of data gathered from the literature, the proposed Machine Learning solution attains a 0.93 correlation with a mean absolute percentage error of 13 %. When testing with our own data, the correlation diminishes to 0.79 and the mean absolute percentage error reduces to 8 %. Thus, the solution for predicting surface roughness in extrusion-based printing offers competitive results regarding the variability of the analyzed factors.
Originality. As available manufacturing data continue to increase on a daily basis, the ability to learn from these large volumes of data is critical in future manufacturing and science. Specifically, the power of Machine Learning helps model surface roughness with limited experimental tests. - [746] arXiv:2406.14479 [pdf, html, other]
-
Title: On Layer-wise Representation Similarity: Application for Multi-Exit Models with a Single ClassifierSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Analyzing the similarity of internal representations within and across different models has been an important technique for understanding the behavior of deep neural networks. Most existing methods for analyzing the similarity between representations of high dimensions, such as those based on Canonical Correlation Analysis (CCA) and widely used Centered Kernel Alignment (CKA), rely on statistical properties of the representations for a set of data points. In this paper, we focus on transformer models and study the similarity of representations between the hidden layers of individual transformers. In this context, we show that a simple sample-wise cosine similarity metric is capable of capturing the similarity and aligns with the complicated CKA. Our experimental results on common transformers reveal that representations across layers are positively correlated, albeit the similarity decreases when layers are far apart. We then propose an aligned training approach to enhance the similarity between internal representations, with trained models that enjoy the following properties: (1) the last-layer classifier can be directly applied right after any hidden layers, yielding intermediate layer accuracies much higher than those under standard training, (2) the layer-wise accuracies monotonically increase and reveal the minimal depth needed for the given task, (3) when served as multi-exit models, they achieve on-par performance with standard multi-exit architectures which consist of additional classifiers designed for early exiting in shallow layers. To our knowledge, our work is the first to show that one common classifier is sufficient for multi-exit models. We conduct experiments on both vision and NLP tasks to demonstrate the performance of the proposed aligned training.
- [747] arXiv:2406.14481 [pdf, html, other]
-
Title: Revealing Vision-Language Integration in the Brain with Multimodal NetworksVighnesh Subramaniam, Colin Conwell, Christopher Wang, Gabriel Kreiman, Boris Katz, Ignacio Cases, Andrei BarbuComments: ICML 2024; 23 pages, 11 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
We use (multi)modal deep neural networks (DNNs) to probe for sites of multimodal integration in the human brain by predicting stereoencephalography (SEEG) recordings taken while human subjects watched movies. We operationalize sites of multimodal integration as regions where a multimodal vision-language model predicts recordings better than unimodal language, unimodal vision, or linearly-integrated language-vision models. Our target DNN models span different architectures (e.g., convolutional networks and transformers) and multimodal training techniques (e.g., cross-attention and contrastive learning). As a key enabling step, we first demonstrate that trained vision and language models systematically outperform their randomly initialized counterparts in their ability to predict SEEG signals. We then compare unimodal and multimodal models against one another. Because our target DNN models often have different architectures, number of parameters, and training sets (possibly obscuring those differences attributable to integration), we carry out a controlled comparison of two models (SLIP and SimCLR), which keep all of these attributes the same aside from input modality. Using this approach, we identify a sizable number of neural sites (on average 141 out of 1090 total sites or 12.94%) and brain regions where multimodal integration seems to occur. Additionally, we find that among the variants of multimodal training techniques we assess, CLIP-style training is the best suited for downstream prediction of the neural activity in these sites.
- [748] arXiv:2406.14482 [pdf, html, other]
-
Title: Visible-Thermal Tiny Object Detection: A Benchmark Dataset and BaselinesXinyi Ying, Chao Xiao, Ruojing Li, Xu He, Boyang Li, Zhaoxu Li, Yingqian Wang, Mingyuan Hu, Qingyu Xu, Zaiping Lin, Miao Li, Shilin Zhou, Wei An, Weidong Sheng, Li LiuSubjects: Computer Vision and Pattern Recognition (cs.CV)
Small object detection (SOD) has been a longstanding yet challenging task for decades, with numerous datasets and algorithms being developed. However, they mainly focus on either visible or thermal modality, while visible-thermal (RGBT) bimodality is rarely explored. Although some RGBT datasets have been developed recently, the insufficient quantity, limited category, misaligned images and large target size cannot provide an impartial benchmark to evaluate multi-category visible-thermal small object detection (RGBT SOD) algorithms. In this paper, we build the first large-scale benchmark with high diversity for RGBT SOD (namely RGBT-Tiny), including 115 paired sequences, 93K frames and 1.2M manual annotations. RGBT-Tiny contains abundant targets (7 categories) and high-diversity scenes (8 types that cover different illumination and density variations). Note that, over 81% of targets are smaller than 16x16, and we provide paired bounding box annotations with tracking ID to offer an extremely challenging benchmark with wide-range applications, such as RGBT fusion, detection and tracking. In addition, we propose a scale adaptive fitness (SAFit) measure that exhibits high robustness on both small and large targets. The proposed SAFit can provide reasonable performance evaluation and promote detection performance. Based on the proposed RGBT-Tiny dataset and SAFit measure, extensive evaluations have been conducted, including 23 recent state-of-the-art algorithms that cover four different types (i.e., visible generic detection, visible SOD, thermal SOD and RGBT object detection). Project is available at this https URL.
- [749] arXiv:2406.14483 [pdf, html, other]
-
Title: Valid Error Bars for Neural Weather Models using Conformal PredictionVignesh Gopakumar, Joel Oskarrson, Ander Gray, Lorenzo Zanisi, Stanislas Pamela, Daniel Giles, Matt Kusner, Marc DeisenrothSubjects: Machine Learning (cs.LG)
Neural weather models have shown immense potential as inexpensive and accurate alternatives to physics-based models. However, most models trained to perform weather forecasting do not quantify the uncertainty associated with their forecasts. This limits the trust in the model and the usefulness of the forecasts. In this work we construct and formalise a conformal prediction framework as a post-processing method for estimating this uncertainty. The method is model-agnostic and gives calibrated error bounds for all variables, lead times and spatial locations. No modifications are required to the model and the computational cost is negligible compared to model training. We demonstrate the usefulness of the conformal prediction framework on a limited area neural weather model for the Nordic region. We further explore the advantages of the framework for deterministic and probabilistic models.
- [750] arXiv:2406.14485 [pdf, other]
-
Title: Proceedings of The second international workshop on eXplainable AI for the Arts (XAIxArts)Nick Bryan-Kinns, Corey Ford, Shuoyang Zheng, Helen Kennedy, Alan Chamberlain, Makayla Lewis, Drew Hemment, Zijin Li, Qiong Wu, Lanxi Xiao, Gus Xia, Jeba Rezwana, Michael Clemens, Gabriel VigliensoniSubjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
This second international workshop on explainable AI for the Arts (XAIxArts) brought together a community of researchers in HCI, Interaction Design, AI, explainable AI (XAI), and digital arts to explore the role of XAI for the Arts. Workshop held at the 16th ACM Conference on Creativity and Cognition (C&C 2024), Chicago, USA.
- [751] arXiv:2406.14489 [pdf, other]
-
Title: A Fuzzy Logic-Based Quality Model For Identifying Microservices With Low MaintainabilityComments: 35 pages, 7 figures, 10 tables, Accepted by Journal of Systems and Software on 17.06.2024,JSSOFTWARE-D-23-01011R2Subjects: Software Engineering (cs.SE)
Microservice Architecture (MSA) is a popular architectural style that offers many advantages regarding quality attributes, including maintainability and scalability. Developing a system as a set of microservices with expected benefits requires a quality assessment strategy that is established on the measurements of the system's properties. This paper proposes a hierarchical quality model based on fuzzy logic to measure and evaluate the maintainability of MSAs considering ISO/IEC 250xy SQuaRE (System and Software Quality Requirements and Evaluation) standards. Since the qualitative bounds of low-level quality attributes are inherently ambiguous, we use a fuzzification technique to transform crisp values of code metrics into fuzzy levels and apply them as inputs to our quality model. The model generates fuzzy values for the quality sub-characteristics of the maintainability, i.e., modifiability and testability, converted to numerical values through defuzzification. In the last step, using the values of the sub-characteristics, we calculate numerical scores indicating the maintainability level of each microservice in the examined software system. This score was used to assess the quality of the microservices and decide whether they need refactoring. We evaluated our approach by creating a test set with the assistance of three developers, who reviewed and categorized the maintainability levels of the microservices in an open-source project based on their knowledge and experience. They labeled microservices as low, medium, or high, with low indicating the need for refactoring. Our method for identifying low-labeled microservices in the given test set achieved 94% accuracy, 78% precision, and 100% recall. These results indicate that our approach can assist designers in evaluating the maintainability quality of microservices.
- [752] arXiv:2406.14491 [pdf, html, other]
-
Title: Instruction Pre-Training: Language Models are Supervised Multitask LearnersSubjects: Computation and Language (cs.CL)
Unsupervised multitask pre-training has been the critical method behind the recent success of language models (LMs). However, supervised multitask learning still holds significant promise, as scaling it in the post-training stage trends towards better generalization. In this paper, we explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train LMs. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training. In pre-training from scratch, Instruction Pre-Training not only consistently enhances pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B. Our model, code, and data are available at this https URL.
- [753] arXiv:2406.14492 [pdf, html, other]
-
Title: Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models?Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Large vision-language models (LVLMs) have recently dramatically pushed the state of the art in image captioning and many image understanding tasks (e.g., visual question answering). LVLMs, however, often \textit{hallucinate} and produce captions that mention concepts that cannot be found in the image. These hallucinations erode the trustworthiness of LVLMs and are arguably among the main obstacles to their ubiquitous adoption. Recent work suggests that addition of grounding objectives -- those that explicitly align image regions or objects to text spans -- reduces the amount of LVLM hallucination. Although intuitive, this claim is not empirically justified as the reduction effects have been established, we argue, with flawed evaluation protocols that (i) rely on data (i.e., MSCOCO) that has been extensively used in LVLM training and (ii) measure hallucination via question answering rather than open-ended caption generation. In this work, in contrast, we offer the first systematic analysis of the effect of fine-grained object grounding on LVLM hallucination under an evaluation protocol that more realistically captures LVLM hallucination in open generation. Our extensive experiments over three backbone LLMs reveal that grounding objectives have little to no effect on object hallucination in open caption generation.
- [754] arXiv:2406.14494 [pdf, other]
-
Title: Teaching Software Metrology: The Science of Measurement for Software EngineeringSubjects: Software Engineering (cs.SE)
While the methodological rigor of computing research has improved considerably in the past two decades, quantitative software engineering research is hampered by immature measures and inattention to theory. Measurement-the principled assignment of numbers to phenomena-is intrinsically difficult because observation is predicated upon not only theoretical concepts but also the values and perspective of the research. Despite several previous attempts to raise awareness of more sophisticated approaches to measurement and the importance of quantitatively assessing reliability and validity, measurement issues continue to be widely ignored. The reasons are unknown, but differences in typical engineering and computer science graduate training programs (compared to psychology and management, for example) are involved. This chapter therefore reviews key concepts in the science of measurement and applies them to software engineering research. A series of exercises for applying important measurement concepts to the reader's research are included, and a sample dataset for the reader to try some of the statistical procedures mentioned is provided.
- [755] arXiv:2406.14495 [pdf, html, other]
-
Title: rKAN: Rational Kolmogorov-Arnold NetworksComments: The implementations are available at this https URLSubjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA)
The development of Kolmogorov-Arnold networks (KANs) marks a significant shift from traditional multi-layer perceptrons in deep learning. Initially, KANs employed B-spline curves as their primary basis function, but their inherent complexity posed implementation challenges. Consequently, researchers have explored alternative basis functions such as Wavelets, Polynomials, and Fractional functions. In this research, we explore the use of rational functions as a novel basis function for KANs. We propose two different approaches based on Pade approximation and rational Jacobi functions as trainable basis functions, establishing the rational KAN (rKAN). We then evaluate rKAN's performance in various deep learning and physics-informed tasks to demonstrate its practicality and effectiveness in function approximation.
- [756] arXiv:2406.14496 [pdf, html, other]
-
Title: African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object ClassificationSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Recent Large Vision-Language Models (LVLMs) demonstrate impressive abilities on numerous image understanding and reasoning tasks. The task of fine-grained object classification (e.g., distinction between \textit{animal species}), however, has been probed insufficiently, despite its downstream importance. We fill this evaluation gap by creating \texttt{FOCI} (\textbf{F}ine-grained \textbf{O}bject \textbf{C}lass\textbf{I}fication), a difficult multiple-choice benchmark for fine-grained object classification, from existing object classification datasets: (1) multiple-choice avoids ambiguous answers associated with casting classification as open-ended QA task; (2) we retain classification difficulty by mining negative labels with a CLIP model. \texttt{FOCI}\xspace complements five popular classification datasets with four domain-specific subsets from ImageNet-21k. We benchmark 12 public LVLMs on \texttt{FOCI} and show that it tests for a \textit{complementary skill} to established image understanding and reasoning benchmarks. Crucially, CLIP models exhibit dramatically better performance than LVLMs. Since the image encoders of LVLMs come from these CLIP models, this points to inadequate alignment for fine-grained object distinction between the encoder and the LLM and warrants (pre)training data with more fine-grained annotation. We release our code at \url{this https URL}.
- [757] arXiv:2406.14497 [pdf, html, other]
-
Title: CodeRAG-Bench: Can Retrieval Augment Code Generation?Zora Zhiruo Wang, Akari Asai, Xinyan Velocity Yu, Frank F. Xu, Yiqing Xie, Graham Neubig, Daniel FriedSubjects: Software Engineering (cs.SE); Computation and Language (cs.CL)
While language models (LMs) have proven remarkably adept at generating code, many programs are challenging for LMs to generate using their parametric knowledge alone. Providing external contexts such as library documentation can facilitate generating accurate and functional code. Despite the success of retrieval-augmented generation (RAG) in various text-oriented tasks, its potential for improving code generation remains under-explored. In this work, we conduct a systematic, large-scale analysis by asking: in what scenarios can retrieval benefit code generation models? and what challenges remain? We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks, including basic programming, open-domain, and repository-level problems. We aggregate documents from five sources for models to retrieve contexts: competition solutions, online tutorials, library documentation, StackOverflow posts, and GitHub repositories. We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources. While notable gains are made in final code generation by retrieving high-quality contexts across various settings, our analysis reveals room for improvement -- current retrievers still struggle to fetch useful contexts especially with limited lexical overlap, and generators fail to improve with limited context lengths or abilities to integrate additional contexts. We hope CodeRAG-Bench serves as an effective testbed to encourage further development of advanced code-oriented RAG methods.
- [758] arXiv:2406.14498 [pdf, html, other]
-
Title: LLaSA: Large Multimodal Agent for Human Activity Analysis Through Wearable SensorsComments: Under review at ARR (for EMNLP 2024)Subjects: Computation and Language (cs.CL)
Integrating inertial measurement units (IMUs) with large language models (LLMs) advances multimodal AI by enhancing human activity understanding. We introduce SensorCaps, a dataset of 26,288 IMU-derived activity narrations, and OpenSQA, an instruction-following dataset with 257,562 question-answer pairs. Combining LIMU-BERT and Llama, we develop LLaSA, a Large Multimodal Agent capable of interpreting and responding to activity and motion analysis queries. Our evaluation demonstrates LLaSA's effectiveness in activity classification and question answering, highlighting its potential in healthcare, sports science, and human-computer interaction. These contributions advance sensor-aware language models and open new research avenues. Our code repository and datasets can be found on this https URL.
- [759] arXiv:2406.14500 [pdf, html, other]
-
Title: Improving Expert Radiology Report Summarization by Prompting Large Language Models with a Layperson SummarySubjects: Computation and Language (cs.CL)
Radiology report summarization (RRS) is crucial for patient care, requiring concise "Impressions" from detailed "Findings." This paper introduces a novel prompting strategy to enhance RRS by first generating a layperson summary. This approach normalizes key observations and simplifies complex information using non-expert communication techniques inspired by doctor-patient interactions. Combined with few-shot in-context learning, this method improves the model's ability to link general terms to specific findings. We evaluate this approach on the MIMIC-CXR, CheXpert, and MIMIC-III datasets, benchmarking it against 7B/8B parameter state-of-the-art open-source large language models (LLMs) like Meta-Llama-3-8B-Instruct. Our results demonstrate improvements in summarization accuracy and accessibility, particularly in out-of-domain tests, with improvements as high as 5% for some metrics.
- [760] arXiv:2406.14503 [pdf, html, other]
-
Title: Overview of the CAIL 2023 Argument Mining TrackJingcong Liang, Junlong Wang, Xinyu Zhai, Yungui Zhuang, Yiyang Zheng, Xin Xu, Xiandong Ran, Xiaozheng Dong, Honghui Rong, Yanlun Liu, Hao Chen, Yuhan Wei, Donghai Li, Jiajie Peng, Xuanjing Huang, Chongde Shi, Yansong Feng, Yun Song, Zhongyu WeiSubjects: Computation and Language (cs.CL)
We give a detailed overview of the CAIL 2023 Argument Mining Track, one of the Chinese AI and Law Challenge (CAIL) 2023 tracks. The main goal of the track is to identify and extract interacting argument pairs in trial dialogs. It mainly uses summarized judgment documents but can also refer to trial recordings. The track consists of two stages, and we introduce the tasks designed for each stage; we also extend the data from previous events into a new dataset -- CAIL2023-ArgMine -- with annotated new cases from various causes of action. We outline several submissions that achieve the best results, including their methods for different stages. While all submissions rely on language models, they have incorporated strategies that may benefit future work in this field.
- [761] arXiv:2406.14504 [pdf, html, other]
-
Title: Translating Across Cultures: LLMs for Intralingual Cultural AdaptationSubjects: Computation and Language (cs.CL)
LLMs are increasingly being deployed for multilingual applications and have demonstrated impressive translation capabilities between several low and high resource languages. An aspect of translation that often gets overlooked is that of cultural adaptation, or modifying source culture references to suit the target culture. Cultural adaptation has applications across several creative industries and requires intimate knowledge of source and target cultures during translation. While specialized translation models still outperform LLMs on the machine translation task when viewed from the lens of correctness, they are not sensitive to cultural differences often requiring manual correction. LLMs on the other hand have a rich reservoir of cultural knowledge embedded within its parameters that can be potentially exploited for such applications. In this paper we define the task of cultural adaptation and create an evaluation framework to benchmark different models for this task. We evaluate the performance of modern LLMs for cultural adaptation and analyze their cross cultural knowledge while connecting related concepts across different cultures. We also analyze possible issues with automatic adaptation including cultural biases and stereotypes. We hope that this task will offer more insight into the cultural understanding of LLMs and their creativity in cross-cultural scenarios.
- [762] arXiv:2406.14506 [pdf, html, other]
-
Title: Online Matching and Contention Resolution for Edge Arrivals with Vanishing ProbabilitiesJournal-ref: In EC 2024Subjects: Data Structures and Algorithms (cs.DS); Discrete Mathematics (cs.DM); Combinatorics (math.CO)
We study the performance of sequential contention resolution and matching algorithms on random graphs with vanishing edge probabilities. When the edges of the graph are processed in an adversarially-chosen order, we derive a new OCRS that is $0.382$-selectable, attaining the "independence benchmark" from the literature under the vanishing edge probabilities assumption. Complementary to this positive result, we show that no OCRS can be more than $0.390$-selectable, significantly improving upon the upper bound of $0.428$ from the literature. We also derive negative results that are specialized to bipartite graphs or subfamilies of OCRS's. Meanwhile, when the edges of the graph are processed in a uniformly random order, we show that the simple greedy contention resolution scheme which accepts all active and feasible edges is $1/2$-selectable. This result is tight due to a known upper bound. Finally, when the algorithm can choose the processing order, we show that a slight tweak to the random order -- give each vertex a random priority and process edges in lexicographic order -- results in a strictly better contention resolution scheme that is $1-\ln(2-1/e)\approx0.510$-selectable. Our positive results also apply to online matching on $1$-uniform random graphs with vanishing (non-identical) edge probabilities, extending and unifying some results from the random graphs literature.
- [763] arXiv:2406.14507 [pdf, html, other]
-
Title: On Newton's Method to Unlearn Neural NetworksSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Machine unlearning facilitates personal data ownership, including the ``right to be forgotten''. The proliferation of applications of \emph{neural networks} (NNs) trained on users' personal data calls for the need to develop algorithms to unlearn an NN. Since retraining is costly, efficiency is often achieved through approximate unlearning which aims to unlearn a trained NN to be close to the retrained one (in distribution). Though the Newton's method has been used by previous works to approximately unlearn linear models, adapting it for unlearning an NN often encounters degenerate Hessians that make computing the Newton's update impossible. In this paper, we will first show that when coupled with naive yet often effective solutions to mitigate the degeneracy issue for unlearning, the Newton's method surprisingly suffers from catastrophic forgetting. To overcome this difficulty, we revise the Newton's method to include a theoretically justified regularizer and propose a cubic-regularized Newton's method for unlearning an NN. The cubic regularizer comes with the benefits of not requiring manual finetuning and affording a natural interpretation. Empirical evaluation on several models and real-world datasets shows that our method is more resilient to catastrophic forgetting and performs better than the baselines, especially in sequential unlearning.
- [764] arXiv:2406.14508 [pdf, html, other]
-
Title: Evidence of a log scaling law for political persuasion with large language modelsComments: 16 pages, 4 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
Large language models can now generate political messages as persuasive as those written by humans, raising concerns about how far this persuasiveness may continue to increase with model size. Here, we generate 720 persuasive messages on 10 U.S. political issues from 24 language models spanning several orders of magnitude in size. We then deploy these messages in a large-scale randomized survey experiment (N = 25,982) to estimate the persuasive capability of each model. Our findings are twofold. First, we find evidence of a log scaling law: model persuasiveness is characterized by sharply diminishing returns, such that current frontier models are barely more persuasive than models smaller in size by an order of magnitude or more. Second, mere task completion (coherence, staying on topic) appears to account for larger models' persuasive advantage. These findings suggest that further scaling model size will not much increase the persuasiveness of static LLM-generated messages.
- [765] arXiv:2406.14510 [pdf, html, other]
-
Title: V-LASIK: Consistent Glasses-Removal from Videos Using Synthetic DataSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
Diffusion-based generative models have recently shown remarkable image and video editing capabilities. However, local video editing, particularly removal of small attributes like glasses, remains a challenge. Existing methods either alter the videos excessively, generate unrealistic artifacts, or fail to perform the requested edit consistently throughout the video. In this work, we focus on consistent and identity-preserving removal of glasses in videos, using it as a case study for consistent local attribute removal in videos. Due to the lack of paired data, we adopt a weakly supervised approach and generate synthetic imperfect data, using an adjusted pretrained diffusion model. We show that despite data imperfection, by learning from our generated data and leveraging the prior of pretrained diffusion models, our model is able to perform the desired edit consistently while preserving the original video content. Furthermore, we exemplify the generalization ability of our method to other local video editing tasks by applying it successfully to facial sticker-removal. Our approach demonstrates significant improvement over existing methods, showcasing the potential of leveraging synthetic data and strong video priors for local video editing tasks.
- [766] arXiv:2406.14511 [pdf, html, other]
-
Title: Investigating Mysteries of CoT-Augmented DistillationComments: Draft; under reviewSubjects: Computation and Language (cs.CL)
Eliciting "chain of thought" (CoT) rationales -- sequences of token that convey a "reasoning" process -- has been shown to consistently improve LLM performance on tasks like question answering. More recent efforts have shown that such rationales can also be used for model distillation: Including CoT sequences (elicited from a large "teacher" model) in addition to target labels when fine-tuning a small student model yields (often substantial) improvements. In this work we ask: Why and how does this additional training signal help in model distillation? We perform ablations to interrogate this, and report some potentially surprising results. Specifically: (1) Placing CoT sequences after labels (rather than before) realizes consistently better downstream performance -- this means that no student "reasoning" is necessary at test time to realize gains. (2) When rationales are appended in this way, they need not be coherent reasoning sequences to yield improvements; performance increases are robust to permutations of CoT tokens, for example. In fact, (3) a small number of key tokens are sufficient to achieve improvements equivalent to those observed when full rationales are used in model distillation.
- [767] arXiv:2406.14514 [pdf, html, other]
-
Title: Solving a Stackelberg Game on Transportation Networks in a Dynamic Crime Scenario: A Mixed Approach on Multi-Layer NetworksSubjects: Artificial Intelligence (cs.AI)
Interdicting a criminal with limited police resources is a challenging task as the criminal changes location over time. The size of the large transportation network further adds to the difficulty of this scenario. To tackle this issue, we consider the concept of a layered graph. At each time stamp, we create a copy of the entire transportation network to track the possible movements of both players, the attacker and the defenders. We consider a Stackelberg game in a dynamic crime scenario where the attacker changes location over time while the defenders attempt to interdict the attacker on his escape route. Given a set of defender strategies, the optimal attacker strategy is determined by applying Dijkstra's algorithm on the layered networks. Here, the attacker aims to minimize while the defenders aim to maximize the probability of interdiction. We develop an approximation algorithm on the layered networks to find near-optimal strategy for defenders. The efficacy of the developed approach is compared with the adopted MILP approach. We compare the results in terms of computational time and solution quality. The quality of the results demonstrates the need for the developed approach, as it effectively solves the complex problem within a short amount of time.
- [768] arXiv:2406.14515 [pdf, html, other]
-
Title: MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video UnderstandingSubjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
The advent of large vision-language models (LVLMs) has spurred research into their applications in multi-modal contexts, particularly in video understanding. Traditional VideoQA benchmarks, despite providing quantitative metrics, often fail to encompass the full spectrum of video content and inadequately assess models' temporal comprehension. To address these limitations, we introduce MMBench-Video, a quantitative benchmark designed to rigorously evaluate LVLMs' proficiency in video understanding. MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases. The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy. We employ GPT-4 for automated assessment, demonstrating superior accuracy and robustness over earlier LLM-based evaluations. Utilizing MMBench-Video, we have conducted comprehensive evaluations that include both proprietary and open-source LVLMs for images and videos. MMBench-Video stands as a valuable resource for the research community, facilitating improved evaluation of LVLMs and catalyzing progress in the field of video understanding. The evalutation code of MMBench-Video will be integrated into VLMEvalKit: this https URL.
- [769] arXiv:2406.14517 [pdf, html, other]
-
Title: PostMark: A Robust Blackbox Watermark for Large Language ModelsComments: preprint; 18 pages, 5 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
The most effective techniques to detect LLM-generated text rely on inserting a detectable signature -- or watermark -- during the model's decoding process. Most existing watermarking methods require access to the underlying LLM's logits, which LLM API providers are loath to share due to fears of model distillation. As such, these watermarks must be implemented independently by each LLM provider. In this paper, we develop PostMark, a modular post-hoc watermarking procedure in which an input-dependent set of words (determined via a semantic embedding) is inserted into the text after the decoding process has completed. Critically, PostMark does not require logit access, which means it can be implemented by a third party. We also show that PostMark is more robust to paraphrasing attacks than existing watermarking methods: our experiments cover eight baseline algorithms, five base LLMs, and three datasets. Finally, we evaluate the impact of PostMark on text quality using both automated and human assessments, highlighting the trade-off between quality and robustness to paraphrasing. We release our code, outputs, and annotations at this https URL.
- [770] arXiv:2406.14525 [pdf, html, other]
-
Title: Towards evolution of Deep Neural Networks through contrastive Self-Supervised learningComments: IEEE World Congress on Computational Intelligence (WCCI) 2024; Keywords: NeuroEvolution, Deep Learning, Evolutionary Machine LearningSubjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Deep Neural Networks (DNNs) have been successfully applied to a wide range of problems. However, two main limitations are commonly pointed out. The first one is that they require long time to design. The other is that they heavily rely on labelled data, which can sometimes be costly and hard to obtain. In order to address the first problem, neuroevolution has been proved to be a plausible option to automate the design of DNNs. As for the second problem, self-supervised learning has been used to leverage unlabelled data to learn representations. Our goal is to study how neuroevolution can help self-supervised learning to bridge the gap to supervised learning in terms of performance. In this work, we propose a framework that is able to evolve deep neural networks using self-supervised learning. Our results on the CIFAR-10 dataset show that it is possible to evolve adequate neural networks while reducing the reliance on labelled data. Moreover, an analysis to the structure of the evolved networks suggests that the amount of labelled data fed to them has less effect on the structure of networks that learned via self-supervised learning, when compared to individuals that relied on supervised learning.
- [771] arXiv:2406.14526 [pdf, html, other]
-
Title: Fantastic Copyrighted Beasts and How (Not) to Generate ThemLuxi He, Yangsibo Huang, Weijia Shi, Tinghao Xie, Haotian Liu, Yue Wang, Luke Zettlemoyer, Chiyuan Zhang, Danqi Chen, Peter HendersonSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Recent studies show that image and video generation models can be prompted to reproduce copyrighted content from their training data, raising serious legal concerns around copyright infringement. Copyrighted characters, in particular, pose a difficult challenge for image generation services, with at least one lawsuit already awarding damages based on the generation of these characters. Yet, little research has empirically examined this issue. We conduct a systematic evaluation to fill this gap. First, we build CopyCat, an evaluation suite consisting of diverse copyrighted characters and a novel evaluation pipeline. Our evaluation considers both the detection of similarity to copyrighted characters and generated image's consistency with user input. Our evaluation systematically shows that both image and video generation models can still generate characters even if characters' names are not explicitly mentioned in the prompt, sometimes with only two generic keywords (e.g., prompting with "videogame, plumber" consistently generates Nintendo's Mario character). We then introduce techniques to semi-automatically identify such keywords or descriptions that trigger character generation. Using our evaluation suite, we study runtime mitigation strategies, including both existing methods and new strategies we propose. Our findings reveal that commonly employed strategies, such as prompt rewriting in the DALL-E system, are not sufficient as standalone guardrails. These strategies must be coupled with other approaches, like negative prompting, to effectively reduce the unintended generation of copyrighted characters. Our work provides empirical grounding to the discussion of copyright mitigation strategies and offers actionable insights for model deployers actively implementing them.
- [772] arXiv:2406.14528 [pdf, html, other]
-
Title: DeciMamba: Exploring the Length Extrapolation Potential of MambaAssaf Ben-Kish, Itamar Zimerman, Shady Abu-Hussein, Nadav Cohen, Amir Globerson, Lior Wolf, Raja GiryesComments: Link To Official Implementation: this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Long-range sequence processing poses a significant challenge for Transformers due to their quadratic complexity in input length. A promising alternative is Mamba, which demonstrates high performance and achieves Transformer-level capabilities while requiring substantially fewer computational resources. In this paper we explore the length-generalization capabilities of Mamba, which we find to be relatively limited. Through a series of visualizations and analyses we identify that the limitations arise from a restricted effective receptive field, dictated by the sequence length used during training. To address this constraint, we introduce DeciMamba, a context-extension method specifically designed for Mamba. This mechanism, built on top of a hidden filtering mechanism embedded within the S6 layer, enables the trained model to extrapolate well even without additional training. Empirical experiments over real-world long-range NLP tasks show that DeciMamba can extrapolate to context lengths that are 25x times longer than the ones seen during training, and does so without utilizing additional computational resources. We will release our code and models.
- [773] arXiv:2406.14529 [pdf, html, other]
-
Title: A Benchmarking Study of Kolmogorov-Arnold Networks on Tabular DataSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Kolmogorov-Arnold Networks (KANs) have very recently been introduced into the world of machine learning, quickly capturing the attention of the entire community. However, KANs have mostly been tested for approximating complex functions or processing synthetic data, while a test on real-world tabular datasets is currently lacking. In this paper, we present a benchmarking study comparing KANs and Multi-Layer Perceptrons (MLPs) on tabular datasets. The study evaluates task performance and training times. From the results obtained on the various datasets, KANs demonstrate superior or comparable accuracy and F1 scores, excelling particularly in datasets with numerous instances, suggesting robust handling of complex data. We also highlight that this performance improvement of KANs comes with a higher computational cost when compared to MLPs of comparable sizes.
- [774] arXiv:2406.14532 [pdf, html, other]
-
Title: RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-FoldSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Training on model-generated synthetic data is a promising approach for finetuning LLMs, but it remains unclear when it helps or hurts. In this paper, we investigate this question for math reasoning via an empirical study, followed by building a conceptual understanding of our observations. First, we find that while the typical approach of finetuning a model on synthetic correct or positive problem-solution pairs generated by capable models offers modest performance gains, sampling more correct solutions from the finetuned learner itself followed by subsequent fine-tuning on this self-generated data $\textbf{doubles}$ the efficiency of the same synthetic problems. At the same time, training on model-generated positives can amplify various spurious correlations, resulting in flat or even inverse scaling trends as the amount of data increases. Surprisingly, we find that several of these issues can be addressed if we also utilize negative responses, i.e., model-generated responses that are deemed incorrect by a final answer verifier. Crucially, these negatives must be constructed such that the training can appropriately recover the utility or advantage of each intermediate step in the negative response. With this per-step scheme, we are able to attain consistent gains over only positive data, attaining performance similar to amplifying the amount of synthetic data by $\mathbf{8 \times}$. We show that training on per-step negatives can help to unlearn spurious correlations in the positive data, and is equivalent to advantage-weighted reinforcement learning (RL), implying that it inherits robustness benefits of RL over imitating positive data alone.
- [775] arXiv:2406.14537 [pdf, html, other]
-
Title: MacroHFT: Memory Augmented Context-aware Reinforcement Learning On High Frequency TradingComments: Accepted to KDD 2024Subjects: Machine Learning (cs.LG); Trading and Market Microstructure (q-fin.TR)
High-frequency trading (HFT) that executes algorithmic trading in short time scales, has recently occupied the majority of cryptocurrency market. Besides traditional quantitative trading methods, reinforcement learning (RL) has become another appealing approach for HFT due to its terrific ability of handling high-dimensional financial data and solving sophisticated sequential decision-making problems, \emph{e.g.,} hierarchical reinforcement learning (HRL) has shown its promising performance on second-level HFT by training a router to select only one sub-agent from the agent pool to execute the current transaction. However, existing RL methods for HFT still have some defects: 1) standard RL-based trading agents suffer from the overfitting issue, preventing them from making effective policy adjustments based on financial context; 2) due to the rapid changes in market conditions, investment decisions made by an individual agent are usually one-sided and highly biased, which might lead to significant loss in extreme markets. To tackle these problems, we propose a novel Memory Augmented Context-aware Reinforcement learning method On HFT, \emph{a.k.a.} MacroHFT, which consists of two training phases: 1) we first train multiple types of sub-agents with the market data decomposed according to various financial indicators, specifically market trend and volatility, where each agent owns a conditional adapter to adjust its trading policy according to market conditions; 2) then we train a hyper-agent to mix the decisions from these sub-agents and output a consistently profitable meta-policy to handle rapid market fluctuations, equipped with a memory mechanism to enhance the capability of decision-making. Extensive experiments on various cryptocurrency markets demonstrate that MacroHFT can achieve state-of-the-art performance on minute-level trading tasks.
- [776] arXiv:2406.14539 [pdf, html, other]
-
Title: Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 StepsComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Diffusion distillation represents a highly promising direction for achieving faithful text-to-image generation in a few sampling steps. However, despite recent successes, existing distilled models still do not provide the full spectrum of diffusion abilities, such as real image inversion, which enables many precise image manipulation methods. This work aims to enrich distilled text-to-image diffusion models with the ability to effectively encode real images into their latent space. To this end, we introduce invertible Consistency Distillation (iCD), a generalized consistency distillation framework that facilitates both high-quality image synthesis and accurate image encoding in only 3-4 inference steps. Though the inversion problem for text-to-image diffusion models gets exacerbated by high classifier-free guidance scales, we notice that dynamic guidance significantly reduces reconstruction errors without noticeable degradation in generation performance. As a result, we demonstrate that iCD equipped with dynamic guidance may serve as a highly effective tool for zero-shot text-guided image editing, competing with more expensive state-of-the-art alternatives.
- [777] arXiv:2406.14540 [pdf, html, other]
-
Title: IRASim: Learning Interactive Real-Robot Action SimulatorsComments: Opensource, project website: this https URLSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Scalable robot learning in the real world is limited by the cost and safety issues of real robots. In addition, rolling out robot trajectories in the real world can be time-consuming and labor-intensive. In this paper, we propose to learn an interactive real-robot action simulator as an alternative. We introduce a novel method, IRASim, which leverages the power of generative models to generate extremely realistic videos of a robot arm that executes a given action trajectory, starting from an initial given frame. To validate the effectiveness of our method, we create a new benchmark, IRASim Benchmark, based on three real-robot datasets and perform extensive experiments on the benchmark. Results show that IRASim outperforms all the baseline methods and is more preferable in human evaluations. We hope that IRASim can serve as an effective and scalable approach to enhance robot learning in the real world. To promote research for generative real-robot action simulators, we open-source code, benchmark, and checkpoints at https: //genthis http URL.
- [778] arXiv:2406.14541 [pdf, html, other]
-
Title: Are LLMs Naturally Good at Synthetic Tabular Data Generation?Subjects: Machine Learning (cs.LG)
Large language models (LLMs) have demonstrated their prowess in generating synthetic text and images; however, their potential for generating tabular data -- arguably the most common data type in business and scientific applications -- is largely underexplored. This paper demonstrates that LLMs, used as-is, or after traditional fine-tuning, are severely inadequate as synthetic table generators. Due to the autoregressive nature of LLMs, fine-tuning with random order permutation runs counter to the importance of modeling functional dependencies, and renders LLMs unable to model conditional mixtures of distributions (key to capturing real world constraints). We showcase how LLMs can be made to overcome some of these deficiencies by making them permutation-aware.
- [779] arXiv:2406.14544 [pdf, html, other]
-
Title: Prism: A Framework for Decoupling and Assessing the Capabilities of VLMsYuxuan Qiao, Haodong Duan, Xinyu Fang, Junming Yang, Lin Chen, Songyang Zhang, Jiaqi Wang, Dahua Lin, Kai ChenSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Vision Language Models (VLMs) demonstrate remarkable proficiency in addressing a wide array of visual questions, which requires strong perception and reasoning faculties. Assessing these two competencies independently is crucial for model refinement, despite the inherent difficulty due to the intertwined nature of seeing and reasoning in existing VLMs. To tackle this issue, we present Prism, an innovative framework designed to disentangle the perception and reasoning processes involved in visual question solving. Prism comprises two distinct stages: a perception stage that utilizes a VLM to extract and articulate visual information in textual form, and a reasoning stage that formulates responses based on the extracted visual information using a Large Language Model (LLM). This modular design enables the systematic comparison and assessment of both proprietary and open-source VLM for their perception and reasoning strengths. Our analytical framework provides several valuable insights, underscoring Prism's potential as a cost-effective solution for vision-language tasks. By combining a streamlined VLM focused on perception with a powerful LLM tailored for reasoning, Prism achieves superior results in general vision-language tasks while substantially cutting down on training and operational expenses. Quantitative evaluations show that Prism, when configured with a vanilla 2B LLaVA and freely accessible GPT-3.5, delivers performance on par with VLMs $10 \times$ larger on the rigorous multimodal benchmark MMStar. The project is released at: this https URL.
- [780] arXiv:2406.14545 [pdf, html, other]
-
Title: Unmasking Database Vulnerabilities: Zero-Knowledge Schema Inference Attacks in Text-to-SQL SystemsSubjects: Computation and Language (cs.CL)
Relational databases are integral to modern information systems, serving as the foundation for storing, querying, and managing data efficiently and effectively. Advancements in large language modeling have led to the emergence of text-to-SQL technologies, significantly enhancing the querying and extracting of information from these databases and raising concerns about privacy and security. Our research extracts the database schema elements underlying a text-to-SQL model. Knowledge of the schema can make attacks such as SQL injection easier. By asking specially crafted questions, we have developed a zero-knowledge framework designed to probe various database schema elements without knowledge of the database itself. The text-to-SQL models then process these questions to produce an output that we use to uncover the structure of the database schema. We apply it to specialized text-to-SQL models fine-tuned on text-SQL pairs and generative language models used for SQL generation. Overall, we can reconstruct the table names with an F1 of nearly .75 for fine-tuned models and .96 for generative.
- [781] arXiv:2406.14546 [pdf, html, other]
-
Title: Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training DataSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
One way to address safety risks from large language models (LLMs) is to censor dangerous knowledge from their training data. While this removes the explicit information, implicit information can remain scattered across various training documents. Could an LLM infer the censored knowledge by piecing together these implicit hints? As a step towards answering this question, we study inductive out-of-context reasoning (OOCR), a type of generalization in which LLMs infer latent information from evidence distributed across training documents and apply it to downstream tasks without in-context learning. Using a suite of five tasks, we demonstrate that frontier LLMs can perform inductive OOCR. In one experiment we finetune an LLM on a corpus consisting only of distances between an unknown city and other known cities. Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and use this fact to answer downstream questions. Further experiments show that LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs $(x,f(x))$ can articulate a definition of $f$ and compute inverses. While OOCR succeeds in a range of cases, we also show that it is unreliable, particularly for smaller LLMs learning complex structures. Overall, the ability of LLMs to "connect the dots" without explicit in-context learning poses a potential obstacle to monitoring and controlling the knowledge acquired by LLMs.
- [782] arXiv:2406.14548 [pdf, html, other]
-
Title: Consistency Models Made EasySubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Consistency models (CMs) are an emerging class of generative models that offer faster sampling than traditional diffusion models. CMs enforce that all points along a sampling trajectory are mapped to the same initial point. But this target leads to resource-intensive training: for example, as of 2024, training a SoTA CM on CIFAR-10 takes one week on 8 GPUs. In this work, we propose an alternative scheme for training CMs, vastly improving the efficiency of building such models. Specifically, by expressing CM trajectories via a particular differential equation, we argue that diffusion models can be viewed as a special case of CMs with a specific discretization. We can thus fine-tune a consistency model starting from a pre-trained diffusion model and progressively approximate the full consistency condition to stronger degrees over the training process. Our resulting method, which we term Easy Consistency Tuning (ECT), achieves vastly improved training times while indeed improving upon the quality of previous methods: for example, ECT achieves a 2-step FID of 2.73 on CIFAR10 within 1 hour on a single A100 GPU, matching Consistency Distillation trained of hundreds of GPU hours. Owing to this computational efficiency, we investigate the scaling law of CMs under ECT, showing that they seem to obey classic power law scaling, hinting at their ability to improve efficiency and performance at larger scales. Code (this https URL) is available.
- [783] arXiv:2406.14549 [pdf, html, other]
-
Title: Uncovering Latent Memories: Assessing Data Leakage and Memorization Patterns in Large Language ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
The proliferation of large language models has revolutionized natural language processing tasks, yet it raises profound concerns regarding data privacy and security. Language models are trained on extensive corpora including potentially sensitive or proprietary information, and the risk of data leakage -- where the model response reveals pieces of such information -- remains inadequately understood. This study examines susceptibility to data leakage by quantifying the phenomenon of memorization in machine learning models, focusing on the evolution of memorization patterns over training. We investigate how the statistical characteristics of training data influence the memories encoded within the model by evaluating how repetition influences memorization. We reproduce findings that the probability of memorizing a sequence scales logarithmically with the number of times it is present in the data. Furthermore, we find that sequences which are not apparently memorized after the first encounter can be uncovered throughout the course of training even without subsequent encounters. The presence of these latent memorized sequences presents a challenge for data privacy since they may be hidden at the final checkpoint of the model. To this end, we develop a diagnostic test for uncovering these latent memorized sequences by considering their cross entropy loss.
- [784] arXiv:2406.14550 [pdf, html, other]
-
Title: GraphReader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language ModelsShilong Li, Yancheng He, Hangyu Guo, Xingyuan Bu, Ge Bai, Jie Liu, Jiaheng Liu, Xingwei Qu, Yangguang Li, Wanli Ouyang, Wenbo Su, Bo ZhengComments: The first four authors contributed equally, 27 pagesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Long-context capabilities are essential for large language models (LLMs) to tackle complex and long-input tasks. Despite numerous efforts made to optimize LLMs for long contexts, challenges persist in robustly processing long inputs. In this paper, we introduce GraphReader, a graph-based agent system designed to handle long texts by structuring them into a graph and employing an agent to explore this graph autonomously. Upon receiving a question, the agent first undertakes a step-by-step analysis and devises a rational plan. It then invokes a set of predefined functions to read node content and neighbors, facilitating a coarse-to-fine exploration of the graph. Throughout the exploration, the agent continuously records new insights and reflects on current circumstances to optimize the process until it has gathered sufficient information to generate an answer. Experimental results on the LV-Eval dataset reveal that GraphReader, using a 4k context window, consistently outperforms GPT-4-128k across context lengths from 16k to 256k by a large margin. Additionally, our approach demonstrates superior performance on four challenging single-hop and multi-hop benchmarks.
- [785] arXiv:2406.14551 [pdf, html, other]
-
Title: Advancing Fine-Grained Classification by Structure and Subject Preserving AugmentationComments: Under review. Code is available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Fine-grained visual classification (FGVC) involves classifying closely related sub-classes. This task is difficult due to the subtle differences between classes and the high intra-class variance. Moreover, FGVC datasets are typically small and challenging to gather, thus highlighting a significant need for effective data augmentation. Recent advancements in text-to-image diffusion models offer new possibilities for augmenting classification datasets. While these models have been used to generate training data for classification tasks, their effectiveness in full-dataset training of FGVC models remains under-explored. Recent techniques that rely on Text2Image generation or Img2Img methods, often struggle to generate images that accurately represent the class while modifying them to a degree that significantly increases the dataset's diversity. To address these challenges, we present SaSPA: Structure and Subject Preserving Augmentation. Contrary to recent methods, our method does not use real images as guidance, thereby increasing generation flexibility and promoting greater diversity. To ensure accurate class representation, we employ conditioning mechanisms, specifically by conditioning on image edges and subject representation. We conduct extensive experiments and benchmark SaSPA against both traditional and recent generative data augmentation methods. SaSPA consistently outperforms all established baselines across multiple settings, including full dataset training, contextual bias, and few-shot classification. Additionally, our results reveal interesting patterns in using synthetic data for FGVC models; for instance, we find a relationship between the amount of real data used and the optimal proportion of synthetic data. Code is available at this https URL.
- [786] arXiv:2406.14553 [pdf, html, other]
-
Title: xCOMET-lite: Bridging the Gap Between Efficiency and Quality in Learned MT Evaluation MetricsSubjects: Computation and Language (cs.CL)
State-of-the-art trainable machine translation evaluation metrics like xCOMET achieve high correlation with human judgment but rely on large encoders (up to 10.7B parameters), making them computationally expensive and inaccessible to researchers with limited resources. To address this issue, we investigate whether the knowledge stored in these large encoders can be compressed while maintaining quality. We employ distillation, quantization, and pruning techniques to create efficient xCOMET alternatives and introduce a novel data collection pipeline for efficient black-box distillation. Our experiments show that, using quantization, xCOMET can be compressed up to three times with no quality degradation. Additionally, through distillation, we create an xCOMET-lite metric, which has only 2.6% of xCOMET-XXL parameters, but retains 92.1% of its quality. Besides, it surpasses strong small-scale metrics like COMET-22 and BLEURT-20 on the WMT22 metrics challenge dataset by 6.4%, despite using 50% fewer parameters. All code, dataset, and models are available online.
- [787] arXiv:2406.14555 [pdf, html, other]
-
Title: A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion ModelsComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users. It is widely studied in recent years as a promising and challenging field of Artificial Intelligence Generative Content (AIGC). Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models, which generate images according to text prompts. These models demonstrate remarkable generative capabilities and have become widely used tools for image editing. T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs. In this survey, we provide a comprehensive review of multimodal-guided image editing techniques that leverage T2I diffusion models. First, we define the scope of image editing from a holistic perspective and detail various control signals and editing scenarios. We then propose a unified framework to formalize the editing process, categorizing it into two primary algorithm families. This framework offers a design space for users to achieve specific goals. Subsequently, we present an in-depth analysis of each component within this framework, examining the characteristics and applicable scenarios of different combinations. Given that training-based methods learn to directly map the source image to target one under user guidance, we discuss them separately, and introduce injection schemes of source image in different scenarios. Additionally, we review the application of 2D techniques to video editing, highlighting solutions for inter-frame inconsistency. Finally, we discuss open challenges in the field and suggest potential future research directions. We keep tracing related works at this https URL.
- [788] arXiv:2406.14556 [pdf, html, other]
-
Title: Asynchronous Large Language Model Enhanced Planner for Autonomous DrivingSubjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Despite real-time planners exhibiting remarkable performance in autonomous driving, the growing exploration of Large Language Models (LLMs) has opened avenues for enhancing the interpretability and controllability of motion planning. Nevertheless, LLM-based planners continue to encounter significant challenges, including elevated resource consumption and extended inference times, which pose substantial obstacles to practical deployment. In light of these challenges, we introduce AsyncDriver, a new asynchronous LLM-enhanced closed-loop framework designed to leverage scene-associated instruction features produced by LLM to guide real-time planners in making precise and controllable trajectory predictions. On one hand, our method highlights the prowess of LLMs in comprehending and reasoning with vectorized scene data and a series of routing instructions, demonstrating its effective assistance to real-time planners. On the other hand, the proposed framework decouples the inference processes of the LLM and real-time planners. By capitalizing on the asynchronous nature of their inference frequencies, our approach have successfully reduced the computational cost introduced by LLM, while maintaining comparable performance. Experiments show that our approach achieves superior closed-loop evaluation performance on nuPlan's challenging scenarios.
- [789] arXiv:2406.14557 [pdf, html, other]
-
Title: Generalized upwind summation-by-parts operators and their application to nodal discontinuous Galerkin methodsJan Glaubitz, Hendrik Ranocha, Andrew R. Winters, Michael Schlottke-Lakemper, Philipp Öffner, Gregor GassnerSubjects: Numerical Analysis (math.NA)
There is a pressing demand for robust, high-order baseline schemes for conservation laws that minimize reliance on supplementary stabilization. In this work, we respond to this demand by developing new baseline schemes within a nodal discontinuous Galerkin (DG) framework, utilizing upwind summation-by-parts (USBP) operators and flux vector splittings. To this end, we demonstrate the existence of USBP operators on arbitrary grid points and provide a straightforward procedure for their construction. Our method encompasses a broader class of USBP operators, not limited to equidistant grid points. This approach facilitates the development of novel USBP operators on Legendre--Gauss--Lobatto (LGL) points, which are suited for nodal discontinuous Galerkin (DG) methods. The resulting DG-USBP operators combine the strengths of traditional summation-by-parts (SBP) schemes with the benefits of upwind discretizations, including inherent dissipation mechanisms. Through numerical experiments, ranging from one-dimensional convergence tests to multi-dimensional curvilinear and under-resolved flow simulations, we find that DG-USBP operators, when integrated with flux vector splitting methods, foster more robust baseline schemes without excessive artificial dissipation.
- [790] arXiv:2406.14558 [pdf, html, other]
-
Title: CooHOI: Learning Cooperative Human-Object Interaction with Manipulated Object DynamicsJiawei Gao, Ziqin Wang, Zeqi Xiao, Jingbo Wang, Tai Wang, Jinkun Cao, Xiaolin Hu, Si Liu, Jifeng Dai, Jiangmiao PangSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Recent years have seen significant advancements in humanoid control, largely due to the availability of large-scale motion capture data and the application of reinforcement learning methodologies. However, many real-world tasks, such as moving large and heavy furniture, require multi-character collaboration. Given the scarcity of data on multi-character collaboration and the efficiency challenges associated with multi-agent learning, these tasks cannot be straightforwardly addressed using training paradigms designed for single-agent scenarios. In this paper, we introduce Cooperative Human-Object Interaction (CooHOI), a novel framework that addresses multi-character objects transporting through a two-phase learning paradigm: individual skill acquisition and subsequent transfer. Initially, a single agent learns to perform tasks using the Adversarial Motion Priors (AMP) framework. Following this, the agent learns to collaborate with others by considering the shared dynamics of the manipulated object during parallel training using Multi Agent Proximal Policy Optimization (MAPPO). When one agent interacts with the object, resulting in specific object dynamics changes, the other agents learn to respond appropriately, thereby achieving implicit communication and coordination between teammates. Unlike previous approaches that relied on tracking-based methods for multi-character HOI, CooHOI is inherently efficient, does not depend on motion capture data of multi-character interactions, and can be seamlessly extended to include more participants and a wide range of object types
- [791] arXiv:2406.14559 [pdf, html, other]
-
Title: Disentangled Representation Learning for Environment-agnostic Speaker RecognitionComments: Interspeech 2024. The official webpage can be found at this https URLSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
This work presents a framework based on feature disentanglement to learn speaker embeddings that are robust to environmental variations. Our framework utilises an auto-encoder as a disentangler, dividing the input speaker embedding into components related to the speaker and other residual information. We employ a group of objective functions to ensure that the auto-encoder's code representation - used as the refined embedding - condenses only the speaker characteristics. We show the versatility of our framework through its compatibility with any existing speaker embedding extractor, requiring no structural modifications or adaptations for integration. We validate the effectiveness of our framework by incorporating it into two popularly used embedding extractors and conducting experiments across various benchmarks. The results show a performance improvement of up to 16%. We release our code for this work to be available this https URL
- [792] arXiv:2406.14561 [pdf, html, other]
-
Title: How to Compute the Probability of a WordSubjects: Computation and Language (cs.CL)
Language models (LMs) estimate the probability distribution over sequences of natural language; these distributions are crucial for computing perplexity and surprisal in linguistics research. While we are usually concerned with measuring these values for words, most LMs operate over subwords. Despite seemingly straightforward, accurately computing probabilities over one unit given probabilities over the other requires care. Indeed, we show here that many recent linguistic studies have been incorrectly computing these values. This paper derives the correct methods for computing word probabilities, highlighting issues when relying on language models that use beginning-of-word (bow)-marking tokenisers, e.g., the GPT family. Empirically, we show that correcting the widespread bug in probability computations affects measured outcomes in sentence comprehension and lexical optimisation analyses.
- [793] arXiv:2406.14562 [pdf, html, other]
-
Title: Whiteboard-of-Thought: Thinking Step-by-Step Across ModalitiesComments: Project website: this http URLSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
When presented with questions involving visual thinking, humans naturally switch reasoning modalities, often forming mental images or drawing visual aids. Large language models have shown promising results in arithmetic and symbolic reasoning by expressing intermediate reasoning in text as a chain of thought, yet struggle to extend this capability to answer text queries that are easily solved by visual reasoning, even with extensive multimodal pretraining. We introduce a simple method, whiteboard-of-thought prompting, to unlock the visual reasoning capabilities of multimodal large language models across modalities. Whiteboard-of-thought prompting provides multimodal large language models with a metaphorical `whiteboard' to draw out reasoning steps as images, then returns these images back to the model for further processing. We find this can be accomplished with no demonstrations or specialized modules, instead leveraging models' existing ability to write code with libraries such as Matplotlib and Turtle. This simple approach shows state-of-the-art results on four difficult natural language tasks that involve visual and spatial reasoning. We identify multiple settings where GPT-4o using chain-of-thought fails dramatically, including more than one where it achieves $0\%$ accuracy, while whiteboard-of-thought enables up to $92\%$ accuracy in these same settings. We present a detailed exploration of where the technique succeeds as well as its sources of error.
- [794] arXiv:2406.14563 [pdf, html, other]
-
Title: Model Merging and Safety Alignment: One Bad Model Spoils the BunchHasan Abed Al Kader Hammoud, Umberto Michieli, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem, Mete OzayComments: Under reviewSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Merging Large Language Models (LLMs) is a cost-effective technique for combining multiple expert LLMs into a single versatile model, retaining the expertise of the original ones. However, current approaches often overlook the importance of safety alignment during merging, leading to highly misaligned models. This work investigates the effects of model merging on alignment. We evaluate several popular model merging techniques, demonstrating that existing methods do not only transfer domain expertise but also propagate misalignment. We propose a simple two-step approach to address this problem: (i) generating synthetic safety and domain-specific data, and (ii) incorporating these generated data into the optimization process of existing data-aware model merging techniques. This allows us to treat alignment as a skill that can be maximized in the resulting merged LLM. Our experiments illustrate the effectiveness of integrating alignment-related data during merging, resulting in models that excel in both domain expertise and alignment.
New submissions for Friday, 21 June 2024 (showing 794 of 794 entries )
- [795] arXiv:2405.13244 (cross-list from quant-ph) [pdf, html, other]
-
Title: Quantum Software Ecosystem DesignAchim Basermann, Michael Epping, Benedikt Fauseweh, Michael Felderer, Elisabeth Lobe, Melven Röhrig-Zöllner, Gary Schmiedinghoff, Peter K. Schuhmacher, Yoshinta Setyawati, Alexander WeinertSubjects: Quantum Physics (quant-ph); Software Engineering (cs.SE)
The rapid advancements in quantum computing necessitate a scientific and rigorous approach to the construction of a corresponding software ecosystem, a topic underexplored and primed for systematic investigation. This chapter takes an important step in this direction: It presents scientific considerations essential for building a quantum software ecosystem that makes quantum computing available for scientific and industrial problem solving. Central to this discourse is the concept of hardware-software co-design, which fosters a bidirectional feedback loop from the application layer at the top of the software stack down to the hardware. This approach begins with compilers and low-level software that are specifically designed to align with the unique specifications and constraints of the quantum processor, proceeds with algorithms developed with a clear understanding of underlying hardware and computational model features, and extends to applications that effectively leverage the capabilities to achieve a quantum advantage. We analyze the ecosystem from two critical perspectives: the conceptual view, focusing on theoretical foundations, and the technical infrastructure, addressing practical implementations around real quantum devices necessary for a functional ecosystem. This approach ensures that the focus is towards promising applications with optimized algorithm-circuit synergy, while ensuring a user-friendly design, an effective data management and an overall orchestration. Our chapter thus offers a guide to the essential concepts and practical strategies necessary for developing a scientifically grounded quantum software ecosystem.
- [796] arXiv:2406.12279 (cross-list from econ.TH) [pdf, html, other]
-
Title: Strategy-proof Selling: a Geometric ApproachSubjects: Theoretical Economics (econ.TH); Computer Science and Game Theory (cs.GT)
We consider one buyer and one seller. For a bundle $(t,q)\in [0,\infty[\times [0,1]=\mathbb{Z}$, $q$ either refers to the wining probability of an object or a share of a good, and $t$ denotes the payment that the buyer makes. We define classical and restricted classical preferences of the buyer on $\mathbb{Z}$; they incorporate quasilinear, non-quasilinear, risk averse preferences with multidimensional pay-off relevant parameters. We define rich single-crossing subsets of the two classes, and characterize strategy-proof mechanisms by using monotonicity of the mechanisms and continuity of the indirect preference correspondences. We also provide a computationally tractable optimization program to compute the optimal mechanism. We do not use revenue equivalence and virtual valuations as tools in our proofs. Our proof techniques bring out the geometric interaction between the single-crossing property and the positions of bundles $(t,q)$s. Our proofs are simple and provide computationally tractable optimization program to compute the optimal mechanism. The extension of the optimization program to the $n-$ buyer environment is immediate.
- [797] arXiv:2406.12881 (cross-list from physics.acc-ph) [pdf, html, other]
-
Title: Towards Unlocking Insights from Logbooks Using AIAntonin Sulc, Alex Bien, Annika Eichler, Daniel Ratner, Florian Rehm, Frank Mayet, Gregor Hartmann, Hayden Hoschouer, Henrik Tuennermann, Jan Kaiser, Jason St. John, Jennefer Maldonado, Kyle Hazelwood, Raimund Kammering, Thorsten Hellert, Tim Wilksen, Verena Kain, Wan-Lin HuComments: 5 pages, 1 figure, 15th International Particle Accelerator ConferenceSubjects: Accelerator Physics (physics.acc-ph); Computation and Language (cs.CL)
Electronic logbooks contain valuable information about activities and events concerning their associated particle accelerator facilities. However, the highly technical nature of logbook entries can hinder their usability and automation. As natural language processing (NLP) continues advancing, it offers opportunities to address various challenges that logbooks present. This work explores jointly testing a tailored Retrieval Augmented Generation (RAG) model for enhancing the usability of particle accelerator logbooks at institutes like DESY, BESSY, Fermilab, BNL, SLAC, LBNL, and CERN. The RAG model uses a corpus built on logbook contributions and aims to unlock insights from these logbooks by leveraging retrieval over facility datasets, including discussion about potential multimodal sources. Our goals are to increase the FAIR-ness (findability, accessibility, interoperability, and reusability) of logbooks by exploiting their information content to streamline everyday use, enable macro-analysis for root cause analysis, and facilitate problem-solving automation.
- [798] arXiv:2406.12888 (cross-list from cond-mat.mtrl-sci) [pdf, html, other]
-
Title: A Space Group Symmetry Informed Network for O(3) Equivariant Crystal Tensor PredictionComments: This paper has been accepted to ICML 24 as a poster. You are encouraged to cite the conference version of this paperSubjects: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Atomic Physics (physics.atom-ph)
We consider the prediction of general tensor properties of crystalline materials, including dielectric, piezoelectric, and elastic tensors. A key challenge here is how to make the predictions satisfy the unique tensor equivariance to O(3) group and invariance to crystal space groups. To this end, we propose a General Materials Tensor Network (GMTNet), which is carefully designed to satisfy the required symmetries. To evaluate our method, we curate a dataset and establish evaluation metrics that are tailored to the intricacies of crystal tensor predictions. Experimental results show that our GMTNet not only achieves promising performance on crystal tensors of various orders but also generates predictions fully consistent with the intrinsic crystal symmetries. Our code is publicly available as part of the AIRS library (this https URL).
- [799] arXiv:2406.12895 (cross-list from q-bio.NC) [pdf, html, other]
-
Title: Temporal Complexity of a Hopfield-Type Neural Model in Random and Scale-Free GraphsSubjects: Neurons and Cognition (q-bio.NC); Disordered Systems and Neural Networks (cond-mat.dis-nn); Mathematical Physics (math-ph); Numerical Analysis (math.NA); Adaptation and Self-Organizing Systems (nlin.AO)
The Hopfield network model and its generalizations were introduced as a model of associative, or content-addressable, memory. They were widely investigated both as a unsupervised learning method in artificial intelligence and as a model of biological neural dynamics in computational neuroscience. The complexity features of biological neural networks are attracting the interest of scientific community since the last two decades. More recently, concepts and tools borrowed from complex network theory were applied to artificial neural networks and learning, thus focusing on the topological aspects. However, the temporal structure is also a crucial property displayed by biological neural networks and investigated in the framework of systems displaying complex intermittency. The Intermittency-Driven Complexity (IDC) approach indeed focuses on the metastability of self-organized states, whose signature is a power-decay in the inter-event time distribution or a scaling behavior in the related event-driven diffusion processes. The investigation of IDC in neural dynamics and its relationship with network topology is still in its early stages. In this work we present the preliminary results of a IDC analysis carried out on a bio-inspired Hopfield-type neural network comparing two different connectivities, i.e., scale-free vs. random network topology. We found that random networks can trigger complexity features similar to that of scale-free networks, even if with some differences and for different parameter values, in particular for different noise levels.
- [800] arXiv:2406.12898 (cross-list from physics.ins-det) [pdf, html, other]
-
Title: A Comprehensive Evaluation of Generative Models in Calorimeter Shower SimulationSubjects: Instrumentation and Detectors (physics.ins-det); Artificial Intelligence (cs.AI); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an)
The pursuit of understanding fundamental particle interactions has reached unparalleled precision levels. Particle physics detectors play a crucial role in generating low-level object signatures that encode collision physics. However, simulating these particle collisions is a demanding task in terms of memory and computation which will be exasperated with larger data volumes, more complex detectors, and a higher pileup environment in the High-Luminosity LHC. The introduction of "Fast Simulation" has been pivotal in overcoming computational bottlenecks. The use of deep-generative models has sparked a surge of interest in surrogate modeling for detector simulations, generating particle showers that closely resemble the observed data. Nonetheless, there is a pressing need for a comprehensive evaluation of their performance using a standardized set of metrics. In this study, we conducted a rigorous evaluation of three generative models using standard datasets and a diverse set of metrics derived from physics, computer vision, and statistics. Furthermore, we explored the impact of using full versus mixed precision modes during inference. Our evaluation revealed that the CaloDiffusion and CaloScore generative models demonstrate the most accurate simulation of particle showers, yet there remains substantial room for improvement. Our findings identified areas where the evaluated models fell short in accurately replicating Geant4 data.
- [801] arXiv:2406.12901 (cross-list from physics.ins-det) [pdf, html, other]
-
Title: Interpretable machine learning approach for electron antineutrino selection in a large liquid scintillator detectorA. Gavrikov, V. Cerrone, A. Serafini, R. Brugnera, A. Garfagnini, M. Grassi, B. Jelmini, L. Lastrucci, S. Aiello, G. Andronico, V. Antonelli, A. Barresi, D. Basilico, M. Beretta, A. Bergnoli, M. Borghesi, A. Brigatti, R. Bruno, A. Budano, B. Caccianiga, A. Cammi, R. Caruso, D. Chiesa, C. Clementi, S. Dusini, A. Fabbri, G. Felici, F. Ferraro, M. G. Giammarchi, N. Giugice, R. M. Guizzetti, N. Guardone, C. Landini, I. Lippi, S. Loffredo, L. Loi, P. Lombardi, C. Lombardo, F. Mantovani, S. M. Mari, A. Martini, L. Miramonti, M. Montuschi, M. Nastasi, D. Orestano, F. Ortica, A. Paoloni, E. Percalli, F. Petrucci, E. Previtali, G. Ranucci, A. C. Re, M. Redchuck, B. Ricci, A. Romani, P. Saggese, G. Sava, C. Sirignano, M. Sisti, L. Stanco, E. Stanescu Farilla, V. Strati, M. D. C. Torri, A. Triossi, C. Tuvé, C. Venettacci, G. Verde, L. VotanoSubjects: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an)
Several neutrino detectors, KamLAND, Daya Bay, Double Chooz, RENO, and the forthcoming large-scale JUNO, rely on liquid scintillator to detect reactor antineutrino interactions. In this context, inverse beta decay represents the golden channel for antineutrino detection, providing a pair of correlated events, thus a strong experimental signature to distinguish the signal from a variety of backgrounds. However, given the low cross-section of antineutrino interactions, the development of a powerful event selection algorithm becomes imperative to achieve effective discrimination between signal and backgrounds. In this study, we introduce a machine learning (ML) model to achieve this goal: a fully connected neural network as a powerful signal-background discriminator for a large liquid scintillator detector. We demonstrate, using the JUNO detector as an example, that, despite the already high efficiency of a cut-based approach, the presented ML model can further improve the overall event selection efficiency. Moreover, it allows for the retention of signal events at the detector edges that would otherwise be rejected because of the overwhelming amount of background events in that region. We also present the first interpretable analysis of the ML approach for event selection in reactor neutrino experiments. This method provides insights into the decision-making process of the model and offers valuable information for improving and updating traditional event selection approaches.
- [802] arXiv:2406.12906 (cross-list from q-bio.NC) [pdf, other]
-
Title: Entropy-statistical approach to phase-locking detection of pulse oscillations: application for the analysis of biosignal synchronizationComments: 23 pages, 12 figures, 3 tablesSubjects: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO)
In this study a new method for analyzing synchronization in oscillator systems is proposed using the example of modeling the dynamics of a circuit of two resistively coupled pulse oscillators. The dynamic characteristic of synchronization is fuzzy entropy (FuzzyEn) calculated a time series composed of the ratios of the number of pulse periods (subharmonic ratio, SHR) during phase-locking intervals. Low entropy values indicate strong synchronization, whereas high entropy values suggest weak synchronization between the two oscillators. This method effectively visualizes synchronized modes of the circuit using entropy maps of synchronization states. Additionally, a classification of synchronization states is proposed based on the dependencies of FuzzyEn on the length of embedding vectors of SHR time series. An extension of this method for analyzing non-relaxation (non-spike) type signals is illustrated using the example of phase-phase coupling rhythms of local field potential of rat hippocampus. The entropy-statistical approach using rational fractions and pulse signal forms makes this method promising for analyzing biosignal synchronization and implementing the algorithm in mobile digital platforms.
- [803] arXiv:2406.12931 (cross-list from eess.AS) [pdf, html, other]
-
Title: Automatic Speech Recognition for Biomedical Data in Bengali LanguageSubjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
This paper presents the development of a prototype Automatic Speech Recognition (ASR) system specifically designed for Bengali biomedical data. Recent advancements in Bengali ASR are encouraging, but a lack of domain-specific data limits the creation of practical healthcare ASR models. This project bridges this gap by developing an ASR system tailored for Bengali medical terms like symptoms, severity levels, and diseases, encompassing two major dialects: Bengali and Sylheti. We train and evaluate two popular ASR frameworks on a comprehensive 46-hour Bengali medical corpus. Our core objective is to create deployable health-domain ASR systems for digital health applications, ultimately increasing accessibility for non-technical users in the healthcare sector.
- [804] arXiv:2406.12937 (cross-list from eess.AS) [pdf, html, other]
-
Title: Self-Train Before You TranscribeComments: Accepted at Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
When there is a mismatch between the training and test domains, current speech recognition systems show significant performance degradation. Self-training methods, such as noisy student teacher training, can help address this and enable the adaptation of models under such domain shifts. However, self-training typically requires a collection of unlabelled target domain data. For settings where this is not practical, we investigate the benefit of performing noisy student teacher training on recordings in the test set as a test-time adaptation approach. Similarly to the dynamic evaluation approach in language modelling, this enables the transfer of information across utterance boundaries and functions as a method of domain adaptation. A range of in-domain and out-of-domain datasets are used for experiments demonstrating large relative gains of up to 32.2%. Interestingly, our method showed larger gains than the typical self-training setup that utilises separate adaptation data.
- [805] arXiv:2406.12946 (cross-list from eess.AS) [pdf, other]
-
Title: Instruction Data Generation and Unsupervised Adaptation for Speech Language ModelsComments: Accepted for Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
In this paper, we propose three methods for generating synthetic samples to train and evaluate multimodal large language models capable of processing both text and speech inputs. Addressing the scarcity of samples containing both modalities, synthetic data generation emerges as a crucial strategy to enhance the performance of such systems and facilitate the modeling of cross-modal relationships between the speech and text domains. Our process employs large language models to generate textual components and text-to-speech systems to generate speech components. The proposed methods offer a practical and effective means to expand the training dataset for these models. Experimental results show progress in achieving an integrated understanding of text and speech. We also highlight the potential of using unlabeled speech data to generate synthetic samples comparable in quality to those with available transcriptions, enabling the expansion of these models to more languages.
- [806] arXiv:2406.12950 (cross-list from q-bio.QM) [pdf, html, other]
-
Title: MolecularGPT: Open Large Language Model (LLM) for Few-Shot Molecular Property PredictionSubjects: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG)
Molecular property prediction (MPP) is a fundamental and crucial task in drug discovery. However, prior methods are limited by the requirement for a large number of labeled molecules and their restricted ability to generalize for unseen and new tasks, both of which are essential for real-world applications. To address these challenges, we present MolecularGPT for few-shot MPP. From a perspective on instruction tuning, we fine-tune large language models (LLMs) based on curated molecular instructions spanning over 1000 property prediction tasks. This enables building a versatile and specialized LLM that can be adapted to novel MPP tasks without any fine-tuning through zero- and few-shot in-context learning (ICL). MolecularGPT exhibits competitive in-context reasoning capabilities across 10 downstream evaluation datasets, setting new benchmarks for few-shot molecular prediction tasks. More importantly, with just two-shot examples, MolecularGPT can outperform standard supervised graph neural network methods on 4 out of 7 datasets. It also excels state-of-the-art LLM baselines by up to 16.6% increase on classification accuracy and decrease of 199.17 on regression metrics (e.g., RMSE) under zero-shot. This study demonstrates the potential of LLMs as effective few-shot molecular property predictors. The code is available at this https URL.
- [807] arXiv:2406.12983 (cross-list from q-fin.CP) [pdf, other]
-
Title: Reinforcement Learning for Corporate Bond Trading: A Sell Side PerspectiveComments: Working PaperSubjects: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Optimization and Control (math.OC)
A corporate bond trader in a typical sell side institution such as a bank provides liquidity to the market participants by buying/selling securities and maintaining an inventory. Upon receiving a request for a buy/sell price quote (RFQ), the trader provides a quote by adding a spread over a \textit{prevalent market price}. For illiquid bonds, the market price is harder to observe, and traders often resort to available benchmark bond prices (such as MarketAxess, Bloomberg, etc.). In \cite{Bergault2023ModelingLI}, the concept of \textit{Fair Transfer Price} for an illiquid corporate bond was introduced which is derived from an infinite horizon stochastic optimal control problem (for maximizing the trader's expected P\&L, regularized by the quadratic variation). In this paper, we consider the same optimization objective, however, we approach the estimation of an optimal bid-ask spread quoting strategy in a data driven manner and show that it can be learned using Reinforcement Learning. Furthermore, we perform extensive outcome analysis to examine the reasonableness of the trained agent's behavior.
- [808] arXiv:2406.12998 (cross-list from eess.AS) [pdf, html, other]
-
Title: Articulatory Encodec: Vocal Tract Kinematics as a Codec for SpeechSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
Vocal tract articulation is a natural, grounded control space of speech production. The spatiotemporal coordination of articulators combined with the vocal source shapes intelligible speech sounds to enable effective spoken communication. Based on this physiological grounding of speech, we propose a new framework of neural encoding-decoding of speech -- articulatory encodec. The articulatory encodec comprises an articulatory analysis model that infers articulatory features from speech audio, and an articulatory synthesis model that synthesizes speech audio from articulatory features. The articulatory features are kinematic traces of vocal tract articulators and source features, which are intuitively interpretable and controllable, being the actual physical interface of speech production. An additional speaker identity encoder is jointly trained with the articulatory synthesizer to inform the voice texture of individual speakers. By training on large-scale speech data, we achieve a fully intelligible, high-quality articulatory synthesizer that generalizes to unseen speakers. Furthermore, the speaker embedding is effectively disentangled from articulations, which enables accent-perserving zero-shot voice conversion. To the best of our knowledge, this is the first demonstration of universal, high-performance articulatory inference and synthesis, suggesting the proposed framework as a powerful coding system of speech.
- [809] arXiv:2406.13023 (cross-list from math.OC) [pdf, other]
-
Title: Stackelberg Games with $k$-Submodular Function under Distributional Risk-Receptiveness and RobustnessSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG)
We study submodular optimization in adversarial context, applicable to machine learning problems such as feature selection using data susceptible to uncertainties and attacks. We focus on Stackelberg games between an attacker (or interdictor) and a defender where the attacker aims to minimize the defender's objective of maximizing a $k$-submodular function. We allow uncertainties arising from the success of attacks and inherent data noise, and address challenges due to incomplete knowledge of the probability distribution of random parameters. Specifically, we introduce Distributionally Risk-Averse $k$-Submodular Interdiction Problem (DRA $k$-SIP) and Distributionally Risk-Receptive $k$-Submodular Interdiction Problem (DRR $k$-SIP) along with finitely convergent exact algorithms for solving them. The DRA $k$-SIP solution allows risk-averse interdictor to develop robust strategies for real-world uncertainties. Conversely, DRR $k$-SIP solution suggests aggressive tactics for attackers, willing to embrace (distributional) risk to inflict maximum damage, identifying critical vulnerable components, which can be used for the defender's defensive strategies. The optimal values derived from both DRA $k$-SIP and DRR $k$-SIP offer a confidence interval-like range for the expected value of the defender's objective function, capturing distributional ambiguity. We conduct computational experiments using instances of feature selection and sensor placement problems, and Wisconsin breast cancer data and synthetic data, respectively.
- [810] arXiv:2406.13030 (cross-list from physics.comp-ph) [pdf, other]
-
Title: Point containment algorithms for constructive solid geometry with unbounded primitivesSubjects: Computational Physics (physics.comp-ph); Computational Geometry (cs.CG)
We present several algorithms for evaluating point containment in constructive solid geometry (CSG) trees with unbounded primitives. Three algorithms are presented based on postfix, prefix, and infix notations of the CSG binary expression tree. We show that prefix and infix notations enable short-circuiting logic, which reduces the number of primitives that must be checked during point containment. To evaluate the performance of the algorithms, each algorithm was implemented in the OpenMC Monte Carlo particle transport code, which relies on CSG to represent solid bodies through which subatomic particles travel. Two sets of tests were carried out. First, the execution time to generate a high-resolution rasterized image of a 2D slice of a detailed CSG model of the ITER tokamak was measured. Use of both prefix and infix notations offered significant speedup over the postfix notation that has traditionally been used in particle transport codes, with infix resulting in a 6$\times$ reduction in execution time relative to postfix. We then measured the execution time of neutron transport simulations of the same ITER model using each of the algorithms. The results and performance improvements reveal the same trends as for the rasterization test, with a 4.59$\times$ overall speedup using the infix notation relative to the original postfix notation in OpenMC.
- [811] arXiv:2406.13036 (cross-list from stat.ML) [pdf, html, other]
-
Title: Sharp detection of low-dimensional structure in probability measures via dimensional logarithmic Sobolev inequalitiesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST); Computation (stat.CO)
Identifying low-dimensional structure in high-dimensional probability measures is an essential pre-processing step for efficient sampling. We introduce a method for identifying and approximating a target measure $\pi$ as a perturbation of a given reference measure $\mu$ along a few significant directions of $\mathbb{R}^{d}$. The reference measure can be a Gaussian or a nonlinear transformation of a Gaussian, as commonly arising in generative modeling. Our method extends prior work on minimizing majorizations of the Kullback--Leibler divergence to identify optimal approximations within this class of measures. Our main contribution unveils a connection between the \emph{dimensional} logarithmic Sobolev inequality (LSI) and approximations with this ansatz. Specifically, when the target and reference are both Gaussian, we show that minimizing the dimensional LSI is equivalent to minimizing the KL divergence restricted to this ansatz. For general non-Gaussian measures, the dimensional LSI produces majorants that uniformly improve on previous majorants for gradient-based dimension reduction. We further demonstrate the applicability of this analysis to the squared Hellinger distance, where analogous reasoning shows that the dimensional Poincaré inequality offers improved bounds.
- [812] arXiv:2406.13059 (cross-list from eess.IV) [pdf, html, other]
-
Title: Learned Compression of Encoding DistributionsComments: 7 pages, 5 figures, IEEE ICIP 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
The entropy bottleneck introduced by Ballé et al. is a common component used in many learned compression models. It encodes a transformed latent representation using a static distribution whose parameters are learned during training. However, the actual distribution of the latent data may vary wildly across different inputs. The static distribution attempts to encompass all possible input distributions, thus fitting none of them particularly well. This unfortunate phenomenon, sometimes known as the amortization gap, results in suboptimal compression. To address this issue, we propose a method that dynamically adapts the encoding distribution to match the latent data distribution for a specific input. First, our model estimates a better encoding distribution for a given input. This distribution is then compressed and transmitted as an additional side-information bitstream. Finally, the decoder reconstructs the encoding distribution and uses it to decompress the corresponding latent data. Our method achieves a Bjøntegaard-Delta (BD)-rate gain of -7.10% on the Kodak test dataset when applied to the standard fully-factorized architecture. Furthermore, considering computational complexity, the transform used by our method is an order of magnitude cheaper in terms of Multiply-Accumulate (MAC) operations compared to related side-information methods such as the scale hyperprior.
- [813] arXiv:2406.13074 (cross-list from hep-ph) [pdf, html, other]
-
Title: PIPPIN: Generating variable length full events from partonsSubjects: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Machine Learning (stat.ML)
This paper presents a novel approach for directly generating full events at detector-level from parton-level information, leveraging cutting-edge machine learning techniques. To address the challenge of multiplicity variations between parton and reconstructed object spaces, we employ transformers, score-based models and normalizing flows. Our method tackles the inherent complexities of the stochastic transition between these two spaces and achieves remarkably accurate results. The combination of innovative techniques and the achieved accuracy demonstrates the potential of our approach in advancing the field and opens avenues for further exploration. This research contributes to the ongoing efforts in high-energy physics and generative modelling, providing a promising direction for enhanced precision in fast detector simulation.
- [814] arXiv:2406.13112 (cross-list from physics.chem-ph) [pdf, other]
-
Title: Nutmeg and SPICE: Models and Data for Biomolecular Machine LearningComments: 24 pages, 8 figures, 7 tablesSubjects: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
We describe version 2 of the SPICE dataset, a collection of quantum chemistry calculations for training machine learning potentials. It expands on the original dataset by adding much more sampling of chemical space and more data on non-covalent interactions. We train a set of potential energy functions called Nutmeg on it. They use a novel mechanism to improve performance on charged and polar molecules, injecting precomputed partial charges into the model to provide a reference for the large scale charge distribution. Evaluation of the new models shows they do an excellent job of reproducing energy differences between conformations, even on highly charged molecules or ones that are significantly larger than the molecules in the training set. They also produce stable molecular dynamics trajectories, and are fast enough to be useful for routine simulation of small molecules.
- [815] arXiv:2406.13139 (cross-list from eess.AS) [pdf, html, other]
-
Title: Audio Fingerprinting with Holographic Reduced RepresentationsComments: accepted at Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
This paper proposes an audio fingerprinting model with holographic reduced representation (HRR). The proposed method reduces the number of stored fingerprints, whereas conventional neural audio fingerprinting requires many fingerprints for each audio track to achieve high accuracy and time resolution. We utilize HRR to aggregate multiple fingerprints into a composite fingerprint via circular convolution and summation, resulting in fewer fingerprints with the same dimensional space as the original. Our search method efficiently finds a combined fingerprint in which a query fingerprint exists. Using HRR's inverse operation, it can recover the relative position within a combined fingerprint, retaining the original time resolution. Experiments show that our method can reduce the number of fingerprints with modest accuracy degradation while maintaining the time resolution, outperforming simple decimation and summation-based aggregation methods.
- [816] arXiv:2406.13150 (cross-list from eess.IV) [pdf, other]
-
Title: MCAD: Multi-modal Conditioned Adversarial Diffusion Model for High-Quality PET Image ReconstructionComments: Early accepted by MICCAI2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Radiation hazards associated with standard-dose positron emission tomography (SPET) images remain a concern, whereas the quality of low-dose PET (LPET) images fails to meet clinical requirements. Therefore, there is great interest in reconstructing SPET images from LPET images. However, prior studies focus solely on image data, neglecting vital complementary information from other modalities, e.g., patients' clinical tabular, resulting in compromised reconstruction with limited diagnostic utility. Moreover, they often overlook the semantic consistency between real SPET and reconstructed images, leading to distorted semantic contexts. To tackle these problems, we propose a novel Multi-modal Conditioned Adversarial Diffusion model (MCAD) to reconstruct SPET images from multi-modal inputs, including LPET images and clinical tabular. Specifically, our MCAD incorporates a Multi-modal conditional Encoder (Mc-Encoder) to extract multi-modal features, followed by a conditional diffusion process to blend noise with multi-modal features and gradually map blended features to the target SPET images. To balance multi-modal inputs, the Mc-Encoder embeds Optimal Multi-modal Transport co-Attention (OMTA) to narrow the heterogeneity gap between image and tabular while capturing their interactions, providing sufficient guidance for reconstruction. In addition, to mitigate semantic distortions, we introduce the Multi-Modal Masked Text Reconstruction (M3TRec), which leverages semantic knowledge extracted from denoised PET images to restore the masked clinical tabular, thereby compelling the network to maintain accurate semantics during reconstruction. To expedite the diffusion process, we further introduce an adversarial diffusive network with a reduced number of diffusion steps. Experiments show that our method achieves the state-of-the-art performance both qualitatively and quantitatively.
- [817] arXiv:2406.13151 (cross-list from stat.ML) [pdf, html, other]
-
Title: von Mises Quasi-Processes for Bayesian Circular RegressionComments: Contribution to the Structured Probabilistic Inference & Generative Modeling workshop of ICML 2024Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
The need for regression models to predict circular values arises in many scientific fields. In this work we explore a family of expressive and interpretable distributions over circle-valued random functions related to Gaussian processes targeting two Euclidean dimensions conditioned on the unit circle. The resulting probability model has connections with continuous spin models in statistical physics. Moreover, its density is very simple and has maximum-entropy, unlike previous Gaussian process-based approaches, which use wrapping or radial marginalization. For posterior inference, we introduce a new Stratonovich-like augmentation that lends itself to fast Markov Chain Monte Carlo sampling. We argue that transductive learning in these models favors a Bayesian approach to the parameters. We present experiments applying this model to the prediction of (i) wind directions and (ii) the percentage of the running gait cycle as a function of joint angles.
- [818] arXiv:2406.13154 (cross-list from stat.ML) [pdf, html, other]
-
Title: Conditional score-based diffusion models for solving inverse problems in mechanicsAgnimitra Dasgupta, Harisankar Ramaswamy, Javier Murgoitio Esandi, Ken Foo, Runze Li, Qifa Zhou, Brendan Kennedy, Assad OberaiSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
We propose a framework to perform Bayesian inference using conditional score-based diffusion models to solve a class of inverse problems in mechanics involving the inference of a specimen's spatially varying material properties from noisy measurements of its mechanical response to loading. Conditional score-based diffusion models are generative models that learn to approximate the score function of a conditional distribution using samples from the joint distribution. More specifically, the score functions corresponding to multiple realizations of the measurement are approximated using a single neural network, the so-called score network, which is subsequently used to sample the posterior distribution using an appropriate Markov chain Monte Carlo scheme based on Langevin dynamics. Training the score network only requires simulating the forward model. Hence, the proposed approach can accommodate black-box forward models and complex measurement noise. Moreover, once the score network has been trained, it can be re-used to solve the inverse problem for different realizations of the measurements. We demonstrate the efficacy of the proposed approach on a suite of high-dimensional inverse problems in mechanics that involve inferring heterogeneous material properties from noisy measurements. Some examples we consider involve synthetic data, while others include data collected from actual elastography experiments. Further, our applications demonstrate that the proposed approach can handle different measurement modalities, complex patterns in the inferred quantities, non-Gaussian and non-additive noise models, and nonlinear black-box forward models. The results show that the proposed framework can solve large-scale physics-based inverse problems efficiently.
- [819] arXiv:2406.13163 (cross-list from cond-mat.mtrl-sci) [pdf, html, other]
-
Title: LLMatDesign: Autonomous Materials Discovery with Large Language ModelsSubjects: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Discovering new materials can have significant scientific and technological implications but remains a challenging problem today due to the enormity of the chemical space. Recent advances in machine learning have enabled data-driven methods to rapidly screen or generate promising materials, but these methods still depend heavily on very large quantities of training data and often lack the flexibility and chemical understanding often desired in materials discovery. We introduce LLMatDesign, a novel language-based framework for interpretable materials design powered by large language models (LLMs). LLMatDesign utilizes LLM agents to translate human instructions, apply modifications to materials, and evaluate outcomes using provided tools. By incorporating self-reflection on its previous decisions, LLMatDesign adapts rapidly to new tasks and conditions in a zero-shot manner. A systematic evaluation of LLMatDesign on several materials design tasks, in silico, validates LLMatDesign's effectiveness in developing new materials with user-defined target properties in the small data regime. Our framework demonstrates the remarkable potential of autonomous LLM-guided materials discovery in the computational setting and towards self-driving laboratories in the future.
- [820] arXiv:2406.13165 (cross-list from eess.IV) [pdf, html, other]
-
Title: Cardiac Copilot: Automatic Probe Guidance for Echocardiography with World ModelComments: Early Accepted by MICCAI 2024Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Echocardiography is the only technique capable of real-time imaging of the heart and is vital for diagnosing the majority of cardiac diseases. However, there is a severe shortage of experienced cardiac sonographers, due to the heart's complex structure and significant operational challenges. To mitigate this situation, we present a Cardiac Copilot system capable of providing real-time probe movement guidance to assist less experienced sonographers in conducting freehand echocardiography. This system can enable non-experts, especially in primary departments and medically underserved areas, to perform cardiac ultrasound examinations, potentially improving global healthcare delivery. The core innovation lies in proposing a data-driven world model, named Cardiac Dreamer, for representing cardiac spatial structures. This world model can provide structure features of any cardiac planes around the current probe position in the latent space, serving as an precise navigation map for autonomous plane localization. We train our model with real-world ultrasound data and corresponding probe motion from 110 routine clinical scans with 151K sample pairs by three certified sonographers. Evaluations on three standard planes with 37K sample pairs demonstrate that the world model can reduce navigation errors by up to 33\% and exhibit more stable performance.
- [821] arXiv:2406.13205 (cross-list from eess.IV) [pdf, other]
-
Title: Application of Computer Deep Learning Model in Diagnosis of Pulmonary NodulesYutian Yang (1), Hongjie Qiu (2), Yulu Gong (3), Xiaoyi Liu (4), Yang Lin (5), Muqing Li (6) ((1) University of California, Davis, (2) University of Washington, (3) Northern Arizona University, (4) Arizona State University, (5) University of Pennsylvania, (6) University of California San Diego)Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
The 3D simulation model of the lung was established by using the reconstruction method. A computer aided pulmonary nodule detection model was constructed. The process iterates over the images to refine the lung nodule recognition model based on neural networks. It is integrated with 3D virtual modeling technology to improve the interactivity of the system, so as to achieve intelligent recognition of lung nodules. A 3D RCNN (Region-based Convolutional Neural Network) was utilized for feature extraction and nodule identification. The LUNA16 large sample database was used as the research dataset. FROC (Free-response Receiver Operating Characteristic) analysis was applied to evaluate the model, calculating sensitivity at various false positive rates to derive the average FROC. Compared with conventional diagnostic methods, the recognition rate was significantly improved. This technique facilitates the detection of pulmonary abnormalities at an initial phase, which holds immense value for the prompt diagnosis of lung malignancies.
- [822] arXiv:2406.13209 (cross-list from eess.IV) [pdf, html, other]
-
Title: Diffusion Model-based FOD Restoration from High Distortion in dMRIComments: 11 pages, 7 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
Fiber orientation distributions (FODs) is a popular model to represent the diffusion MRI (dMRI) data. However, imaging artifacts such as susceptibility-induced distortion in dMRI can cause signal loss and lead to the corrupted reconstruction of FODs, which prohibits successful fiber tracking and connectivity analysis in affected brain regions such as the brain stem. Generative models, such as the diffusion models, have been successfully applied in various image restoration tasks. However, their application on FOD images poses unique challenges since FODs are 4-dimensional data represented by spherical harmonics (SPHARM) with the 4-th dimension exhibiting order-related dependency. In this paper, we propose a novel diffusion model for FOD restoration that can recover the signal loss caused by distortion artifacts. We use volume-order encoding to enhance the ability of the diffusion model to generate individual FOD volumes at all SPHARM orders. Moreover, we add cross-attention features extracted across all SPHARM orders in generating every individual FOD volume to capture the order-related dependency across FOD volumes. We also condition the diffusion model with low-distortion FODs surrounding high-distortion areas to maintain the geometric coherence of the generated FODs. We trained and tested our model using data from the UK Biobank (n = 1315). On a test set with ground truth (n = 43), we demonstrate the high accuracy of the generated FODs in terms of root mean square errors of FOD volumes and angular errors of FOD peaks. We also apply our method to a test set with large distortion in the brain stem area (n = 1172) and demonstrate the efficacy of our method in restoring the FOD integrity and, hence, greatly improving tractography performance in affected brain regions.
- [823] arXiv:2406.13268 (cross-list from eess.AS) [pdf, html, other]
-
Title: CEC: A Noisy Label Detection Method for Speaker RecognitionComments: interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Noisy labels are inevitable, even in well-annotated datasets. The detection of noisy labels is of significant importance to enhance the robustness of speaker recognition models. In this paper, we propose a novel noisy label detection approach based on two new statistical metrics: Continuous Inconsistent Counting (CIC) and Total Inconsistent Counting (TIC). These metrics are calculated through Cross-Epoch Counting (CEC) and correspond to the early and late stages of training, respectively. Additionally, we categorize samples based on their prediction results into three categories: inconsistent samples, hard samples, and easy samples. During training, we gradually increase the difficulty of hard samples to update model parameters, preventing noisy labels from being overfitted. Compared to contrastive schemes, our approach not only achieves the best performance in speaker verification but also excels in noisy label detection.
- [824] arXiv:2406.13292 (cross-list from q-bio.QM) [pdf, html, other]
-
Title: An interpretable generative multimodal neuroimaging-genomics framework for decoding Alzheimer's diseaseGiorgio Dolci (1,2), Federica Cruciani (1), Md Abdur Rahaman (2), Anees Abrol (2), Jiayu Chen (2), Zening Fu (2), Ilaria Boscolo Galazzo (1), Gloria Menegaz (1), Vince D. Calhoun (2) ((1) Department of Engineering for Innovation Medicine, University of Verona, Verona, Italy, (2) Tri-Institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS), Georgia State University, Georgia Institute of Technology, Emory University, Atlanta, GA, USA)Comments: 27 pages, 7 figures, submitted to a journalSubjects: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
Alzheimer's disease (AD) is the most prevalent form of dementia with a progressive decline in cognitive abilities. The AD continuum encompasses a prodormal stage known as Mild Cognitive Impairment (MCI), where patients may either progress to AD or remain stable. In this study, we leveraged structural and functional MRI to investigate the disease-induced grey matter and functional network connectivity changes. Moreover, considering AD's strong genetic component, we introduce SNPs as a third channel. Given such diverse inputs, missing one or more modalities is a typical concern of multimodal methods. We hence propose a novel deep learning-based classification framework where generative module employing Cycle GANs was adopted to impute missing data within the latent space. Additionally, we adopted an Explainable AI method, Integrated Gradients, to extract input features relevance, enhancing our understanding of the learned representations. Two critical tasks were addressed: AD detection and MCI conversion prediction. Experimental results showed that our model was able to reach the SOA in the classification of CN/AD reaching an average test accuracy of $0.926\pm0.02$. For the MCI task, we achieved an average prediction accuracy of $0.711\pm0.01$ using the pre-trained model for CN/AD. The interpretability analysis revealed significant grey matter modulations in cortical and subcortical brain areas well known for their association with AD. Moreover, impairments in sensory-motor and visual resting state network connectivity along the disease continuum, as well as mutations in SNPs defining biological processes linked to amyloid-beta and cholesterol formation clearance and regulation, were identified as contributors to the achieved performance. Overall, our integrative deep learning approach shows promise for AD detection and MCI prediction, while shading light on important biological insights.
- [825] arXiv:2406.13312 (cross-list from eess.AS) [pdf, html, other]
-
Title: Pushing the Limit of Sound Event Detection with Multi-Dilated Frequency Dynamic ConvolutionSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Frequency dynamic convolution (FDY conv) has been a milestone in the sound event detection (SED) field, but it involves a substantial increase in model size due to multiple basis kernels. In this work, we propose partial frequency dynamic convolution (PFD conv), which concatenates static conventional 2D convolution branch output and dynamic FDY conv branch output in order to minimize model size increase while maintaining the performance. Additionally, we propose multi-dilated frequency dynamic convolution (MDFD conv), which integrates multiple dilated frequency dynamic convolution (DFD conv) branches with different dilation size sets and a static branch within a single convolution module, achieving a 3.2% improvement in polyphonic sound detection score (PSDS) over FDY conv. Proposed methods with extensive ablation studies further enhance understanding and usability of FDY conv variants.
- [826] arXiv:2406.13321 (cross-list from math.CO) [pdf, html, other]
-
Title: On dual-ABAB-free and related hypergraphsSubjects: Combinatorics (math.CO); Computational Geometry (cs.CG)
Geometric motivations warranted the study of hypergraphs on ordered vertices that have no pair of hyperedges that induce an alternation of some given length. Such hypergraphs are called ABA-free, ABAB-free and so on. Since then various coloring and other combinatorial results were proved about these families of hypergraphs. We prove a characterization in terms of their incidence matrices which avoids using the ordering of the vertices. Using this characterization, we prove new results about the dual hypergraphs of ABAB-free hypergraphs. In particular, we show that dual-ABAB-free hypergraphs are not always proper $2$-colorable even if we restrict ourselves to hyperedges that are larger than some parameter $m$.
- [827] arXiv:2406.13337 (cross-list from eess.AS) [pdf, html, other]
-
Title: Medical Spoken Named Entity RecognitionComments: Preprint, 40 pagesSubjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
Spoken Named Entity Recognition (NER) aims to extracting named entities from speech and categorizing them into types like person, location, organization, etc. In this work, we present VietMed-NER - the first spoken NER dataset in the medical domain. To our best knowledge, our real-world dataset is the largest spoken NER dataset in the world in terms of the number of entity types, featuring 18 distinct types. Secondly, we present baseline results using various state-of-the-art pre-trained models: encoder-only and sequence-to-sequence. We found that pre-trained multilingual models XLM-R outperformed all monolingual models on both reference text and ASR output. Also in general, encoders perform better than sequence-to-sequence models for the NER task. By simply translating, the transcript is applicable not just to Vietnamese but to other languages as well. All code, data and models are made publicly available here: this https URL
- [828] arXiv:2406.13370 (cross-list from math.PR) [pdf, other]
-
Title: Computing the invariant distribution of McKean-Vlasov SDEs by ergodic simulationSubjects: Probability (math.PR); Numerical Analysis (math.NA)
We design a fully implementable scheme to compute the invariant distribution of ergodic McKean-Vlasov SDE satisfying a uniform confluence property. Under natural conditions, we prove various convergence results notably we obtain rates for the Wasserstein distance in quadratic mean and almost sure sense.
- [829] arXiv:2406.13379 (cross-list from math.OC) [pdf, html, other]
-
Title: Numerical Methods for Shape Optimal Design of Fluid-Structure Interaction ProblemsSubjects: Optimization and Control (math.OC); Numerical Analysis (math.NA)
We consider the method of mappings for performing shape optimization for unsteady fluid-structure interaction (FSI) problems. In this work, we focus on the numerical implementation. We model the optimization problem such that it takes several theoretical results into account, such as regularity requirements on the transformations and a differential geometrical point of view on the manifold of shapes. Moreover, we discretize the problem such that we can compute exact discrete gradients. This allows for the use of general purpose optimization solvers. We focus on an FSI benchmark problem to validate our numerical implementation. The method is used to optimize parts of the outer boundary and the interface. The numerical simulations build on FEniCS, dolfin-adjoint and IPOPT. Moreover, as an additional theoretical result, we show that for a linear special case the adjoint attains the same structure as the forward problem but reverses the temporal flow of information.
- [830] arXiv:2406.13383 (cross-list from nlin.CG) [pdf, html, other]
-
Title: Emergent Dynamics in Heterogeneous Life-Like Cellular AutomataAarati Shrestha, Felix Reimers, Sanyam Jain, Paolo Baldini, Michele Braccini, Andrea Roli, Stefano NicheleComments: 16 pages, 9 FiguresSubjects: Cellular Automata and Lattice Gases (nlin.CG); Emerging Technologies (cs.ET)
The Game of Life (GoL), one well known 2D cellular automaton, does not typically ensure interesting long-term phenotypic dynamics. Therefore, while being Turing complete, GoL cannot be said to be open-ended. In this work, we extend GoL with the opportunity for local mutations, thus enabling a heterogeneous life-like cellular automaton guided by an evolutionary inner loop. Additionally, we introduce the concept of cell ageing to ensure that cell aliveness (activated by inheritance with variation, and controlled by ageing) and actual cell computation (governed by life-like rules on local neighborhoods) are kept conceptually separated. We conduct an experimental campaign to identify suitable parameters that produce long-term phenotypic dynamics and favor genotypic innovations.
- [831] arXiv:2406.13385 (cross-list from eess.AS) [pdf, html, other]
-
Title: Explainable by-design Audio Segmentation through Non-Negative Matrix Factorization and ProbingComments: Accepted at Interspeech 2024, 5 pages, 2 figures, 3 tablesSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
Audio segmentation is a key task for many speech technologies, most of which are based on neural networks, usually considered as black boxes, with high-level performances. However, in many domains, among which health or forensics, there is not only a need for good performance but also for explanations about the output decision. Explanations derived directly from latent representations need to satisfy "good" properties, such as informativeness, compactness, or modularity, to be interpretable. In this article, we propose an explainable-by-design audio segmentation model based on non-negative matrix factorization (NMF) which is a good candidate for the design of interpretable representations. This paper shows that our model reaches good segmentation performances, and presents deep analyses of the latent representation extracted from the non-negative matrix. The proposed approach opens new perspectives toward the evaluation of interpretable representations according to "good" properties.
- [832] arXiv:2406.13386 (cross-list from eess.AS) [pdf, html, other]
-
Title: Online Domain-Incremental Learning Approach to Classify Acoustic Scenes in All LocationsComments: Accepted to EUSIPCO 2024Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
In this paper, we propose a method for online domain-incremental learning of acoustic scene classification from a sequence of different locations. Simply training a deep learning model on a sequence of different locations leads to forgetting of previously learned knowledge. In this work, we only correct the statistics of the Batch Normalization layers of a model using a few samples to learn the acoustic scenes from a new location without any excessive training. Experiments are performed on acoustic scenes from 11 different locations, with an initial task containing acoustic scenes from 6 locations and the remaining 5 incremental tasks each representing the acoustic scenes from a different location. The proposed approach outperforms fine-tuning based methods and achieves an average accuracy of 48.8% after learning the last task in sequence without forgetting acoustic scenes from the previously learned locations.
- [833] arXiv:2406.13389 (cross-list from cond-mat.soft) [pdf, html, other]
-
Title: Unifying Mixed Gas Adsorption in Molecular Sieve Membranes and MOFs using Machine LearningComments: Accepted in Separation and Purification Technology, on June 16, 2024. Data available at this https URLSubjects: Soft Condensed Matter (cond-mat.soft); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
Recent machine learning models to accurately obtain gas adsorption isotherms focus on polymers or metal-organic frameworks (MOFs) separately. The difficulty in creating a unified model that can predict the adsorption trends in both types of adsorbents is challenging, owing to the diversity in their chemical structures. Moreover, models trained only on single gas adsorption data are incapable of predicting adsorption isotherms for binary gas mixtures. In this work, we address these problems using feature vectors comprising only the physical properties of the gas mixtures and adsorbents. Our model is trained on adsorption isotherms of both single and binary mixed gases inside carbon molecular sieving membrane (CMSM), together with data available from CoRE MOF database. The trained models are capable of accurately predicting the adsorption trends in both classes of materials, for both pure and binary components. ML architecture designed for one class of material, is not suitable for predicting the other class, even after proper training, signifying that the model must be trained jointly for proper predictions and transferability. The model is used to predict with good accuracy the CO2 uptake inside CALF-20 framework. This work opens up a new avenue for predicting complex adsorption processes for gas mixtures in a wide range of materials.
- [834] arXiv:2406.13405 (cross-list from quant-ph) [pdf, html, other]
-
Title: Gate Teleportation in Noisy Quantum Networks with the SquidASM SimulatorComments: 10 pages, 20 figuresSubjects: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Networking and Internet Architecture (cs.NI)
We implement the gate teleportation algorithm for teleporting arbitrary two-qubit Clifford gates and the Toffoli gate within the context of multi-node quantum networks, utilizing the SquidASM quantum network simulator. We show how a gate teleportation scheme can be used to implement gate cutting, which is an important approach to realize large circuits in distributed quantum computing environments. The correction operations in teleportation are automatically constructed for arbitrary two-qubit Clifford gates. We present simulation results for CNOT, DCNOT, CZ, SWAP, and Toffoli gates. For the Toffoli gate, we apply a similar gate teleportation protocol with the difference that the correction operation becomes more complex since the gate is non-Clifford. We perform the simulations under varying conditions of quantum channel and device noise levels. The simulations provide valuable insights into the robustness and efficacy of the implemented algorithms, and they assist in identifying the critical components within quantum networks where noise primarily affects the execution of applications.
- [835] arXiv:2406.13412 (cross-list from quant-ph) [pdf, html, other]
-
Title: Tensor Decompositions and Adiabatic Quantum Computing for Discovering Practical Matrix Multiplication AlgorithmsComments: 12 pages, 6 figuresSubjects: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS)
Quantum computing and modern tensor-based computing have a strong connection, which is especially demonstrated by simulating quantum computations with tensor networks. The other direction is less studied: quantum computing is not often applied to tensor-based problems. Considering tensor decompositions, we focus on discovering practical matrix multiplication algorithms and develop two algorithms to compute decompositions on quantum computers. The algorithms are expressed as higher-order unconstrained binary optimization (HUBO) problems, which are translated into quadratic unconstrained binary optimization (QUBO) problems. Our first algorithm is decompositional to keep the optimization problem feasible for the current quantum devices. Starting from a suitable initial point, the algorithm discovers tensor decomposition corresponding to the famous Strassen matrix multiplication algorithm, utilizing the current quantum annealers. Since the decompositional algorithm does not guarantee minimal length for found tensor decompositions, we develop a holistic algorithm that can find fixed-length decompositions. Theoretically, by fixing a shorter length than the length for the best-known decomposition, we can ensure that the solution to the holistic optimization problem would yield faster matrix multiplication algorithms.
- [836] arXiv:2406.13413 (cross-list from eess.IV) [pdf, html, other]
-
Title: Recurrent Inference Machine for Medical Image RegistrationComments: PreprintSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Image registration is essential for medical image applications where alignment of voxels across multiple images is needed for qualitative or quantitative analysis. With recent advancements in deep neural networks and parallel computing, deep learning-based medical image registration methods become competitive with their flexible modelling and fast inference capabilities. However, compared to traditional optimization-based registration methods, the speed advantage may come at the cost of registration performance at inference time. Besides, deep neural networks ideally demand large training datasets while optimization-based methods are training-free. To improve registration accuracy and data efficiency, we propose a novel image registration method, termed Recurrent Inference Image Registration (RIIR) network. RIIR is formulated as a meta-learning solver to the registration problem in an iterative manner. RIIR addresses the accuracy and data efficiency issues, by learning the update rule of optimization, with implicit regularization combined with explicit gradient input.
We evaluated RIIR extensively on brain MRI and quantitative cardiac MRI datasets, in terms of both registration accuracy and training data efficiency. Our experiments showed that RIIR outperformed a range of deep learning-based methods, even with only $5\%$ of the training data, demonstrating high data efficiency. Key findings from our ablation studies highlighted the important added value of the hidden states introduced in the recurrent inference framework for meta-learning. Our proposed RIIR offers a highly data-efficient framework for deep learning-based medical image registration. - [837] arXiv:2406.13425 (cross-list from stat.ML) [pdf, html, other]
-
Title: Coupled Input-Output Dimension Reduction: Application to Goal-oriented Bayesian Experimental Design and Global Sensitivity AnalysisSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
We introduce a new method to jointly reduce the dimension of the input and output space of a high-dimensional function. Choosing a reduced input subspace influences which output subspace is relevant and vice versa. Conventional methods focus on reducing either the input or output space, even though both are often reduced simultaneously in practice. Our coupled approach naturally supports goal-oriented dimension reduction, where either an input or output quantity of interest is prescribed. We consider, in particular, goal-oriented sensor placement and goal-oriented sensitivity analysis, which can be viewed as dimension reduction where the most important output or, respectively, input components are chosen. Both applications present difficult combinatorial optimization problems with expensive objectives such as the expected information gain and Sobol indices. By optimizing gradient-based bounds, we can determine the most informative sensors and most sensitive parameters as the largest diagonal entries of some diagnostic matrices, thus bypassing the combinatorial optimization and objective evaluation.
- [838] arXiv:2406.13441 (cross-list from eess.IV) [pdf, html, other]
-
Title: Robust Melanoma Thickness Prediction via Deep Transfer Learning enhanced by XAI TechniquesSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
This study focuses on analyzing dermoscopy images to determine the depth of melanomas, which is a critical factor in diagnosing and treating skin cancer. The Breslow depth, measured from the top of the granular layer to the deepest point of tumor invasion, serves as a crucial parameter for staging melanoma and guiding treatment decisions. This research aims to improve the prediction of the depth of melanoma through the use of machine learning models, specifically deep learning, while also providing an analysis of the possible existance of graduation in the images characteristics which correlates with the depth of the melanomas. Various datasets, including ISIC and private collections, were used, comprising a total of 1162 images. The datasets were combined and balanced to ensure robust model training. The study utilized pre-trained Convolutional Neural Networks (CNNs). Results indicated that the models achieved significant improvements over previous methods. Additionally, the study conducted a correlation analysis between model's predictions and actual melanoma thickness, revealing a moderate correlation that improves with higher thickness values. Explainability methods such as feature visualization through Principal Component Analysis (PCA) demonstrated the capability of deep features to distinguish between different depths of melanoma, providing insight into the data distribution and model behavior. In summary, this research presents a dual contribution: enhancing the state-of-the-art classification results through advanced training techniques and offering a detailed analysis of the data and model behavior to better understand the relationship between dermoscopy images and melanoma thickness.
- [839] arXiv:2406.13447 (cross-list from math.ST) [pdf, other]
-
Title: High-probability minimax lower boundsComments: 37 pages, 3 figuresSubjects: Statistics Theory (math.ST); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
The minimax risk is often considered as a gold standard against which we can compare specific statistical procedures. Nevertheless, as has been observed recently in robust and heavy-tailed estimation problems, the inherent reduction of the (random) loss to its expectation may entail a significant loss of information regarding its tail behaviour. In an attempt to avoid such a loss, we introduce the notion of a minimax quantile, and seek to articulate its dependence on the quantile level. To this end, we develop high-probability variants of the classical Le Cam and Fano methods, as well as a technique to convert local minimax risk lower bounds to lower bounds on minimax quantiles. To illustrate the power of our framework, we deploy our techniques on several examples, recovering recent results in robust mean estimation and stochastic convex optimisation, as well as obtaining several new results in covariance matrix estimation, sparse linear regression, nonparametric density estimation and isotonic regression. Our overall goal is to argue that minimax quantiles can provide a finer-grained understanding of the difficulty of statistical problems, and that, in wide generality, lower bounds on these quantities can be obtained via user-friendly tools.
- [840] arXiv:2406.13452 (cross-list from quant-ph) [pdf, html, other]
-
Title: Quantum Networks: from Multipartite Entanglement to Hypergraph ImmersionComments: 8 pages, 3 figures, 1 tableSubjects: Quantum Physics (quant-ph); Discrete Mathematics (cs.DM); Social and Information Networks (cs.SI); Computational Physics (physics.comp-ph); Physics and Society (physics.soc-ph)
Multipartite entanglement, a higher-order interaction unique to quantum information, offers various advantages over bipartite entanglement in quantum network (QN) applications. Establishing multipartite entanglement across remote parties in QN requires entanglement routing, which irreversibly transforms the QN topology at the cost of existing entanglement links. Here, we address the question of whether a QN can be topologically transformed into another via entanglement routing. Our key result is an exact mapping from multipartite entanglement routing to Nash-Williams's graph immersion problem, extended to hypergraphs. This generalized hypergraph immersion problem introduces a partial order between QN topologies, permitting certain topological transformations while precluding others, offering discerning insights into the design and manipulation of higher-order network topologies in QNs.
- [841] arXiv:2406.13470 (cross-list from eess.SP) [pdf, html, other]
-
Title: Automatic Voice Classification Of Autistic SubjectsComments: Accepted for publication at EAI BODYNETS 2023 2024 - 18th EAI International Conference on Body Area Networks: Intelligent Edge Cloud for Dependable Globally Connected BAN, February 5-6, 2024 Milan, ItalySubjects: Signal Processing (eess.SP); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Autism Spectrum Disorders (ASD) describe a heterogeneous set of conditions classified as neurodevelopmental disorders. Although the mechanisms underlying ASD are not yet fully understood, more recent literature focused on multiple genetics and/or environmental risk factors. Heterogeneity of symptoms, especially in milder forms of this condition, could be a challenge for the clinician. In this work, an automatic speech classification algorithm is proposed to characterize the prosodic elements that best distinguish autism, to support the traditional diagnosis. The performance of the proposed algorithm is evaluted by testing the classification algorithms on a dataset composed of recorded speeches, collected among both autustic and non autistic subjects.
- [842] arXiv:2406.13471 (cross-list from eess.AS) [pdf, html, other]
-
Title: Diffusion-based Generative Modeling with Discriminative Guidance for Streamable Speech EnhancementSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Diffusion-based generative models (DGMs) have recently attracted attention in speech enhancement research (SE) as previous works showed a remarkable generalization capability. However, DGMs are also computationally intensive, as they usually require many iterations in the reverse diffusion process (RDP), making them impractical for streaming SE systems. In this paper, we propose to use discriminative scores from discriminative models in the first steps of the RDP. These discriminative scores require only one forward pass with the discriminative model for multiple RDP steps, thus greatly reducing computations. This approach also allows for performance improvements. We show that we can trade off between generative and discriminative capabilities as the number of steps with the discriminative score increases. Furthermore, we propose a novel streamable time-domain generative model with an algorithmic latency of 50 ms, which has no significant performance degradation compared to offline models.
- [843] arXiv:2406.13486 (cross-list from q-fin.MF) [pdf, html, other]
-
Title: Mean-Variance Portfolio Selection in Long-Term Investments with Unknown Distribution: Online Estimation, Risk Aversion under Ambiguity, and Universality of AlgorithmsComments: 21 pages, working paper, first draft version (may contain errors)Subjects: Mathematical Finance (q-fin.MF); Machine Learning (cs.LG); Probability (math.PR); Portfolio Management (q-fin.PM)
The standard approach for constructing a Mean-Variance portfolio involves estimating parameters for the model using collected samples. However, since the distribution of future data may not resemble that of the training set, the out-of-sample performance of the estimated portfolio is worse than one derived with true parameters, which has prompted several innovations for better estimation. Instead of treating the data without a timing aspect as in the common training-backtest approach, this paper adopts a perspective where data gradually and continuously reveal over time. The original model is recast into an online learning framework, which is free from any statistical assumptions, to propose a dynamic strategy of sequential portfolios such that its empirical utility, Sharpe ratio, and growth rate asymptotically achieve those of the true portfolio, derived with perfect knowledge of the future data.
When the distribution of future data has a normal shape, the growth rate of wealth is shown to increase by lifting the portfolio along the efficient frontier through the calibration of risk aversion. Since risk aversion cannot be appropriately predetermined, another proposed algorithm updating this coefficient over time forms a dynamic strategy approaching the optimal empirical Sharpe ratio or growth rate associated with the true coefficient. The performance of these proposed strategies is universally guaranteed under specific stochastic markets. Furthermore, in stationary and ergodic markets, the so-called Bayesian strategy utilizing true conditional distributions, based on observed past market information during investment, almost surely does not perform better than the proposed strategies in terms of empirical utility, Sharpe ratio, or growth rate, which, in contrast, do not rely on conditional distributions. - [844] arXiv:2406.13488 (cross-list from stat.ML) [pdf, html, other]
-
Title: Approximately Equivariant Neural ProcessesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Equivariant deep learning architectures exploit symmetries in learning problems to improve the sample efficiency of neural-network-based models and their ability to generalise. However, when modelling real-world data, learning problems are often not exactly equivariant, but only approximately. For example, when estimating the global temperature field from weather station observations, local topographical features like mountains break translation equivariance. In these scenarios, it is desirable to construct architectures that can flexibly depart from exact equivariance in a data-driven way. In this paper, we develop a general approach to achieving this using existing equivariant architectures. Our approach is agnostic to both the choice of symmetry group and model architecture, making it widely applicable. We consider the use of approximately equivariant architectures in neural processes (NPs), a popular family of meta-learning models. We demonstrate the effectiveness of our approach on a number of synthetic and real-world regression experiments, demonstrating that approximately equivariant NP models can outperform both their non-equivariant and strictly equivariant counterparts.
- [845] arXiv:2406.13523 (cross-list from physics.app-ph) [pdf, html, other]
-
Title: Measurement of the Crystallization and Phase Transition of Niobium Dioxide Thin-Films for Neuromorphic Computing Applications Using a Tube Furnace Optical Transmission SystemZachary R. Robinson, Karsten Beckmann, James Michels, Vincent Daviero, Elizabeth A. Street, Fiona Lorenzen, Matthew C. Sullivan, Nathaniel Cady, Alexander Kozen, Marc CurrieSubjects: Applied Physics (physics.app-ph); Materials Science (cond-mat.mtrl-sci); Emerging Technologies (cs.ET)
Significant research has focused on low-power stochastic devices built from memristive materials. These devices foster neuromorphic approaches to computational efficiency enhancement in merged biomimetic and CMOS architectures due to their ability to phase transition from a dielectric to a metal at an increased temperature. Niobium dioxide has a volatile memristive phase change that occurs $\sim$800$^\circ$C~that makes it an ideal candidate for future neuromorphic electronics. A straightforward optical system has been developed on a horizontal tube furnace for \emph{in situ} spectral measurements as an as-grown \NbtOf\ film is annealed and ultimately crystallizes as \NbOt. The system measures the changing spectral transmissivity of \NbtOf\ as it undergoes both reduction and crystallization processes. We were also able to measure the transition from metallic-to-non-metallic \NbOt\ during the cooldown phase, which is shown to occur about 100$^\circ$C~ lower on a sapphire substrate than fused silica. After annealing, the material properties of the \NbtOf\ and \NbOt\ were assessed via X-ray photoelectron spectroscopy, X-ray diffraction, and 4-point resistivity, confirming that we have made crystalline \NbOt.
- [846] arXiv:2406.13612 (cross-list from math.OC) [pdf, html, other]
-
Title: On Computation of Approximate Solutions to Large-Scale Backstepping Kernel Equations via Continuum ApproximationComments: 13 pages, 5 figures, submitted to Systems & Control LettersSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
We provide two methods for computation of continuum backstepping kernels that arise in control of continua (ensembles) of linear hyperbolic PDEs and which can approximate backstepping kernels arising in control of a large-scale, PDE system counterpart (with computational complexity that does not grow with the number of state components of the large-scale system). In the first method, we identify a class of systems for which the solution to the continuum (and hence, also an approximate solution to the respective large-scale) kernel equations can be constructed in closed form. In the second method, we provide explicit formulae for the solution to the continuum kernels PDEs, employing a (triple) power series representation of the continuum kernel and establishing its convergence properties. In this case, we also provide means for reducing computational complexity by properly truncating the power series (in the powers of the ensemble variable). We also present numerical examples to illustrate computational efficiency/accuracy of the approaches, as well as to validate the stabilization properties of the approximate control kernels, constructed based on the continuum.
- [847] arXiv:2406.13619 (cross-list from stat.ML) [pdf, html, other]
-
Title: Generative Modeling by Minimizing the Wasserstein-2 LossSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
This paper approaches the unsupervised learning problem by minimizing the second-order Wasserstein loss (the $W_2$ loss). The minimization is characterized by a distribution-dependent ordinary differential equation (ODE), whose dynamics involves the Kantorovich potential between a current estimated distribution and the true data distribution. A main result shows that the time-marginal law of the ODE converges exponentially to the true data distribution. To prove that the ODE has a unique solution, we first construct explicitly a solution to the associated nonlinear Fokker-Planck equation and show that it coincides with the unique gradient flow for the $W_2$ loss. Based on this, a unique solution to the ODE is built from Trevisan's superposition principle and the exponential convergence results. An Euler scheme is proposed for the distribution-dependent ODE and it is shown to correctly recover the gradient flow for the $W_2$ loss in the limit. An algorithm is designed by following the scheme and applying persistent training, which is natural in our gradient-flow framework. In both low- and high-dimensional experiments, our algorithm converges much faster than and outperforms Wasserstein generative adversarial networks, by increasing the level of persistent training appropriately.
- [848] arXiv:2406.13645 (cross-list from eess.IV) [pdf, html, other]
-
Title: Advancing UWF-SLO Vessel Segmentation with Source-Free Active Domain Adaptation and a Novel Multi-Center DatasetComments: MICCAI 2024 Early AcceptSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Accurate vessel segmentation in Ultra-Wide-Field Scanning Laser Ophthalmoscopy (UWF-SLO) images is crucial for diagnosing retinal diseases. Although recent techniques have shown encouraging outcomes in vessel segmentation, models trained on one medical dataset often underperform on others due to domain shifts. Meanwhile, manually labeling high-resolution UWF-SLO images is an extremely challenging, time-consuming and expensive task. In response, this study introduces a pioneering framework that leverages a patch-based active domain adaptation approach. By actively recommending a few valuable image patches by the devised Cascade Uncertainty-Predominance (CUP) selection strategy for labeling and model-finetuning, our method significantly improves the accuracy of UWF-SLO vessel segmentation across diverse medical centers. In addition, we annotate and construct the first Multi-center UWF-SLO Vessel Segmentation (MU-VS) dataset to promote this topic research, comprising data from multiple institutions. This dataset serves as a valuable resource for cross-center evaluation, verifying the effectiveness and robustness of our approach. Experimental results demonstrate that our approach surpasses existing domain adaptation and active learning methods, considerably reducing the gap between the Upper and Lower bounds with minimal annotations, highlighting our method's practical clinical value. We will release our dataset and code to facilitate relevant research: this https URL.
- [849] arXiv:2406.13674 (cross-list from eess.IV) [pdf, html, other]
-
Title: Rethinking Abdominal Organ Segmentation (RAOS) in the clinical scenario: A robustness evaluation benchmark with challenging casesComments: 10 pages, 1 figure, 6 tables, Early Accept to MICCAI 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Deep learning has enabled great strides in abdominal multi-organ segmentation, even surpassing junior oncologists on common cases or organs. However, robustness on corner cases and complex organs remains a challenging open problem for clinical adoption. To investigate model robustness, we collected and annotated the RAOS dataset comprising 413 CT scans ($\sim$80k 2D images, $\sim$8k 3D organ annotations) from 413 patients each with 17 (female) or 19 (male) labelled organs, manually delineated by oncologists. We grouped scans based on clinical information into 1) diagnosis/radiotherapy (317 volumes), 2) partial excision without the whole organ missing (22 volumes), and 3) excision with the whole organ missing (74 volumes). RAOS provides a potential benchmark for evaluating model robustness including organ hallucination. It also includes some organs that can be very hard to access on public datasets like the rectum, colon, intestine, prostate and seminal vesicles. We benchmarked several state-of-the-art methods in these three clinical groups to evaluate performance and robustness. We also assessed cross-generalization between RAOS and three public datasets. This dataset and comprehensive analysis establish a potential baseline for future robustness research: \url{this https URL}.
- [850] arXiv:2406.13705 (cross-list from eess.IV) [pdf, html, other]
-
Title: EndoUIC: Promptable Diffusion Transformer for Unified Illumination Correction in Capsule EndoscopyLong Bai, Qiaozhi Tan, Tong Chen, Wan Jun Nah, Yanheng Li, Zhicheng He, Sishen Yuan, Zhen Chen, Jinlin Wu, Mobarakol Islam, Zhen Li, Hongbin Liu, Hongliang RenComments: To appear in MICCAI 2024. Code and dataset availability: this https URLSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Wireless Capsule Endoscopy (WCE) is highly valued for its non-invasive and painless approach, though its effectiveness is compromised by uneven illumination from hardware constraints and complex internal dynamics, leading to overexposed or underexposed images. While researchers have discussed the challenges of low-light enhancement in WCE, the issue of correcting for different exposure levels remains underexplored. To tackle this, we introduce EndoUIC, a WCE unified illumination correction solution using an end-to-end promptable diffusion transformer (DFT) model. In our work, the illumination prompt module shall navigate the model to adapt to different exposure levels and perform targeted image enhancement, in which the Adaptive Prompt Integration (API) and Global Prompt Scanner (GPS) modules shall further boost the concurrent representation learning between the prompt parameters and features. Besides, the U-shaped restoration DFT model shall capture the long-range dependencies and contextual information for unified illumination restoration. Moreover, we present a novel Capsule-endoscopy Exposure Correction (CEC) dataset, including ground-truth and corrupted image pairs annotated by expert photographers. Extensive experiments against a variety of state-of-the-art (SOTA) methods on four datasets showcase the effectiveness of our proposed method and components in WCE illumination restoration, and the additional downstream experiments further demonstrate its utility for clinical diagnosis and surgical assistance.
- [851] arXiv:2406.13709 (cross-list from eess.IV) [pdf, html, other]
-
Title: A Study on the Effect of Color Spaces in Learned Image CompressionSrivatsa Prativadibhayankaram, Mahadev Prasad Panda, Jürgen Seiler, Thomas Richter, Heiko Sparenberg, Siegfried Fößel, André KaupComments: Accepter pre-print version for ICIP 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
In this work, we present a comparison between color spaces namely YUV, LAB, RGB and their effect on learned image compression. For this we use the structure and color based learned image codec (SLIC) from our prior work, which consists of two branches - one for the luminance component (Y or L) and another for chrominance components (UV or AB). However, for the RGB variant we input all 3 channels in a single branch, similar to most learned image codecs operating in RGB. The models are trained for multiple bitrate configurations in each color space. We report the findings from our experiments by evaluating them on various datasets and compare the results to state-of-the-art image codecs. The YUV model performs better than the LAB variant in terms of MS-SSIM with a Bjøntegaard delta bitrate (BD-BR) gain of 7.5\% using VTM intra-coding mode as the baseline. Whereas the LAB variant has a better performance than YUV model in terms of CIEDE2000 having a BD-BR gain of 8\%. Overall, the RGB variant of SLIC achieves the best performance with a BD-BR gain of 13.14\% in terms of MS-SSIM and a gain of 17.96\% in CIEDE2000 at the cost of a higher model complexity.
- [852] arXiv:2406.13726 (cross-list from math.OC) [pdf, html, other]
-
Title: Global Solutions to Master Equations for Continuous Time Heterogeneous Agent Macroeconomic ModelsSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); General Economics (econ.GN)
We propose and compare new global solution algorithms for continuous time heterogeneous agent economies with aggregate shocks. First, we approximate the agent distribution so that equilibrium in the economy can be characterized by a high, but finite, dimensional non-linear partial differential equation. We consider different approximations: discretizing the number of agents, discretizing the agent state variables, and projecting the distribution onto a finite set of basis functions. Second, we represent the value function using a neural network and train it to solve the differential equation using deep learning tools. We refer to the solution as an Economic Model Informed Neural Network (EMINN). The main advantage of this technique is that it allows us to find global solutions to high dimensional, non-linear problems. We demonstrate our algorithm by solving important models in the macroeconomics and spatial literatures (e.g. Krusell and Smith (1998), Khan and Thomas (2007), Bilal (2023)).
- [853] arXiv:2406.13750 (cross-list from eess.IV) [pdf, html, other]
-
Title: Empowering Tuberculosis Screening with Explainable Self-Supervised Deep Neural NetworksComments: 9 pages, 3 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Tuberculosis persists as a global health crisis, especially in resource-limited populations and remote regions, with more than 10 million individuals newly infected annually. It stands as a stark symbol of inequity in public health. Tuberculosis impacts roughly a quarter of the global populace, with the majority of cases concentrated in eight countries, accounting for two-thirds of all tuberculosis infections. Although a severe ailment, tuberculosis is both curable and manageable. However, early detection and screening of at-risk populations are imperative. Chest x-ray stands as the predominant imaging technique utilized in tuberculosis screening efforts. However, x-ray screening necessitates skilled radiologists, a resource often scarce, particularly in remote regions with limited resources. Consequently, there is a pressing need for artificial intelligence (AI)-powered systems to support clinicians and healthcare providers in swift screening. However, training a reliable AI model necessitates large-scale high-quality data, which can be difficult and costly to acquire. Inspired by these challenges, in this work, we introduce an explainable self-supervised self-train learning network tailored for tuberculosis case screening. The network achieves an outstanding overall accuracy of 98.14% and demonstrates high recall and precision rates of 95.72% and 99.44%, respectively, in identifying tuberculosis cases, effectively capturing clinically significant features.
- [854] arXiv:2406.13783 (cross-list from econ.TH) [pdf, html, other]
-
Title: Nash equilibria of quasisupermodular gamesSubjects: Theoretical Economics (econ.TH); Computer Science and Game Theory (cs.GT)
We prove three results on the existence and structure of Nash equilibria for quasisupermodular games. A theorem is purely order-theoretic, and the other two involve topological hypotheses. Our topological results genralize Zhou's theorem (for supermodular games) and Calciano's theorem.
- [855] arXiv:2406.13815 (cross-list from eess.IV) [pdf, other]
-
Title: IG-CFAT: An Improved GAN-Based Framework for Effectively Exploiting Transformers in Real-World Image Super-ResolutionSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
In the field of single image super-resolution (SISR), transformer-based models, have demonstrated significant advancements. However, the potential and efficiency of these models in applied fields such as real-world image super-resolution are less noticed and there are substantial opportunities for improvement. Recently, composite fusion attention transformer (CFAT), outperformed previous state-of-the-art (SOTA) models in classic image super-resolution. This paper extends the CFAT model to an improved GAN-based model called IG-CFAT to effectively exploit the performance of transformers in real-world image super-resolution. IG-CFAT incorporates a semantic-aware discriminator to reconstruct image details more accurately, significantly improving perceptual quality. Moreover, our model utilizes an adaptive degradation model to better simulate real-world degradations. Our methodology adds wavelet losses to conventional loss functions of GAN-based super-resolution models to reconstruct high-frequency details more efficiently. Empirical results demonstrate that IG-CFAT sets new benchmarks in real-world image super-resolution, outperforming SOTA models in both quantitative and qualitative metrics.
- [856] arXiv:2406.13839 (cross-list from q-bio.BM) [pdf, html, other]
-
Title: RNA-FrameFlow: Flow Matching for de novo 3D RNA Backbone DesignRishabh Anand, Chaitanya K. Joshi, Alex Morehead, Arian R. Jamasb, Charles Harris, Simon V. Mathis, Kieran Didi, Bryan Hooi, Pietro LiòComments: To be presented as an Oral at ICML 2024 Structured Probabilistic Inference & Generative Modeling Workshop, and a Spotlight at ICML 2024 AI4Science WorkshopSubjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Genomics (q-bio.GN)
We introduce RNA-FrameFlow, the first generative model for 3D RNA backbone design. We build upon SE(3) flow matching for protein backbone generation and establish protocols for data preparation and evaluation to address unique challenges posed by RNA modeling. We formulate RNA structures as a set of rigid-body frames and associated loss functions which account for larger, more conformationally flexible RNA backbones (13 atoms per nucleotide) vs. proteins (4 atoms per residue). Toward tackling the lack of diversity in 3D RNA datasets, we explore training with structural clustering and cropping augmentations. Additionally, we define a suite of evaluation metrics to measure whether the generated RNA structures are globally self-consistent (via inverse folding followed by forward folding) and locally recover RNA-specific structural descriptors. The most performant version of RNA-FrameFlow generates locally realistic RNA backbones of 40-150 nucleotides, over 40% of which pass our validity criteria as measured by a self-consistency TM-score >= 0.45, at which two RNAs have the same global fold. Open-source code: this https URL
- [857] arXiv:2406.13879 (cross-list from quant-ph) [pdf, html, other]
-
Title: A Catalyst Framework for the Quantum Linear System Problem via the Proximal Point AlgorithmSubjects: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Optimization and Control (math.OC)
Solving systems of linear equations is a fundamental problem, but it can be computationally intensive for classical algorithms in high dimensions. Existing quantum algorithms can achieve exponential speedups for the quantum linear system problem (QLSP) in terms of the problem dimension, but even such a theoretical advantage is bottlenecked by the condition number of the coefficient matrix. In this work, we propose a new quantum algorithm for QLSP inspired by the classical proximal point algorithm (PPA). Our proposed method can be viewed as a meta-algorithm that allows inverting a modified matrix via an existing \texttt{QLSP\_solver}, thereby directly approximating the solution vector instead of approximating the inverse of the coefficient matrix. By carefully choosing the step size $\eta$, the proposed algorithm can effectively precondition the linear system to mitigate the dependence on condition numbers that hindered the applicability of previous approaches.
- [858] arXiv:2406.13888 (cross-list from math.OC) [pdf, html, other]
-
Title: Open Problem: Anytime Convergence Rate of Gradient DescentComments: COLT 2024 open problem; 5 pagesSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG)
Recent results show that vanilla gradient descent can be accelerated for smooth convex objectives, merely by changing the stepsize sequence. We show that this can lead to surprisingly large errors indefinitely, and therefore ask: Is there any stepsize schedule for gradient descent that accelerates the classic $\mathcal{O}(1/T)$ convergence rate, at \emph{any} stopping time $T$?
- [859] arXiv:2406.13895 (cross-list from eess.IV) [pdf, html, other]
-
Title: INFusion: Diffusion Regularized Implicit Neural Representations for 2D and 3D accelerated MRI reconstructionComments: 6 pages, 4 figures, asilomar 2024 submissionSubjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
Implicit Neural Representations (INRs) are a learning-based approach to accelerate Magnetic Resonance Imaging (MRI) acquisitions, particularly in scan-specific settings when only data from the under-sampled scan itself are available. Previous work demonstrates that INRs improve rapid MRI through inherent regularization imposed by neural network architectures. Typically parameterized by fully-connected neural networks, INRs support continuous image representations by taking a physical coordinate location as input and outputting the intensity at that coordinate. Previous work has applied unlearned regularization priors during INR training and have been limited to 2D or low-resolution 3D acquisitions. Meanwhile, diffusion based generative models have received recent attention as they learn powerful image priors decoupled from the measurement model. This work proposes INFusion, a technique that regularizes the optimization of INRs from under-sampled MR measurements with pre-trained diffusion models for improved image reconstruction. In addition, we propose a hybrid 3D approach with our diffusion regularization that enables INR application on large-scale 3D MR datasets. 2D experiments demonstrate improved INR training with our proposed diffusion regularization, and 3D experiments demonstrate feasibility of INR training with diffusion regularization on 3D matrix sizes of 256 by 256 by 80.
- [860] arXiv:2406.13902 (cross-list from math.CO) [pdf, html, other]
-
Title: Signed combinatorial interpretations in algebraic combinatoricsComments: 23 pagesSubjects: Combinatorics (math.CO); Discrete Mathematics (cs.DM)
We prove the existence of signed combinatorial interpretations for several large families of structure constants. These families include standard bases of symmetric and quasisymmetric polynomials, as well as various bases in Schubert theory. The results are stated in the language of computational complexity, while the proofs are based on the effective Möbius inversion.
- [861] arXiv:2406.13935 (cross-list from eess.AS) [pdf, html, other]
-
Title: CONMOD: Controllable Neural Frame-based Modulation EffectsSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
Deep learning models have seen widespread use in modelling LFO-driven audio effects, such as phaser and flanger. Although existing neural architectures exhibit high-quality emulation of individual effects, they do not possess the capability to manipulate the output via control parameters. To address this issue, we introduce Controllable Neural Frame-based Modulation Effects (CONMOD), a single black-box model which emulates various LFO-driven effects in a frame-wise manner, offering control over LFO frequency and feedback parameters. Additionally, the model is capable of learning the continuous embedding space of two distinct phaser effects, enabling us to steer between effects and achieve creative outputs. Our model outperforms previous work while possessing both controllability and universality, presenting opportunities to enhance creativity in modern LFO-driven audio effects.
- [862] arXiv:2406.13936 (cross-list from stat.ML) [pdf, html, other]
-
Title: Communication-Efficient Adaptive Batch Size Strategies for Distributed Local Gradient MethodsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
Modern deep neural networks often require distributed training with many workers due to their large size. As worker numbers increase, communication overheads become the main bottleneck in data-parallel minibatch stochastic gradient methods with per-iteration gradient synchronization. Local gradient methods like Local SGD reduce communication by only syncing after several local steps. Despite understanding their convergence in i.i.d. and heterogeneous settings and knowing the importance of batch sizes for efficiency and generalization, optimal local batch sizes are difficult to determine. We introduce adaptive batch size strategies for local gradient methods that increase batch sizes adaptively to reduce minibatch gradient variance. We provide convergence guarantees under homogeneous data conditions and support our claims with image classification experiments, demonstrating the effectiveness of our strategies in training and generalization.
- [863] arXiv:2406.13944 (cross-list from math.ST) [pdf, html, other]
-
Title: Generalization error of min-norm interpolators in transfer learningComments: 53 pages, 2 figuresSubjects: Statistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
This paper establishes the generalization error of pooled min-$\ell_2$-norm interpolation in transfer learning where data from diverse distributions are available. Min-norm interpolators emerge naturally as implicit regularized limits of modern machine learning algorithms. Previous work characterized their out-of-distribution risk when samples from the test distribution are unavailable during training. However, in many applications, a limited amount of test data may be available during training, yet properties of min-norm interpolation in this setting are not well-understood. We address this gap by characterizing the bias and variance of pooled min-$\ell_2$-norm interpolation under covariate and model shifts. The pooled interpolator captures both early fusion and a form of intermediate fusion. Our results have several implications: under model shift, for low signal-to-noise ratio (SNR), adding data always hurts. For higher SNR, transfer learning helps as long as the shift-to-signal (SSR) ratio lies below a threshold that we characterize explicitly. By consistently estimating these ratios, we provide a data-driven method to determine: (i) when the pooled interpolator outperforms the target-based interpolator, and (ii) the optimal number of target samples that minimizes the generalization error. Under covariate shift, if the source sample size is small relative to the dimension, heterogeneity between between domains improves the risk, and vice versa. We establish a novel anisotropic local law to achieve these characterizations, which may be of independent interest in random matrix theory. We supplement our theoretical characterizations with comprehensive simulations that demonstrate the finite-sample efficacy of our results.
- [864] arXiv:2406.13946 (cross-list from physics.app-ph) [pdf, other]
-
Title: Ferroelectric Materials for Synaptic Transistors and Their Neuromorphic ApplicationsComments: 44 pages, 7 figures,Subjects: Applied Physics (physics.app-ph); Materials Science (cond-mat.mtrl-sci); Emerging Technologies (cs.ET)
After more than a hundred years of development, ferroelectric materials have demonstrated their strong potential to people, and more and more ferroelectric materials are being used in the research of ferroelectric transistors (FeFETs). As a new generation of neuromorphic devices, ferroelectric materials have attracted people's attention due to their powerful functions and many characteristics. This article summarizes the development of ferroelectric material systems in recent years and discusses the simulation of artificial synapses. The mainstream ferroelectric materials are divided into traditional perovskite structure, fluorite structure, organic polymer, and new 2D van der Waals ferroelectricity. The principles, research progress, and optimization for brain like computers of each material system are introduced, and the latest application progress is summarized. Finally, the scope of application of different material systems is discussed, with the aim of helping people screen out different material systems based on different needs.
- [865] arXiv:2406.13977 (cross-list from eess.IV) [pdf, html, other]
-
Title: Similarity-aware Syncretic Latent Diffusion Model for Medical Image Translation with Representation LearningSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Non-contrast CT (NCCT) imaging may reduce image contrast and anatomical visibility, potentially increasing diagnostic uncertainty. In contrast, contrast-enhanced CT (CECT) facilitates the observation of regions of interest (ROI). Leading generative models, especially the conditional diffusion model, demonstrate remarkable capabilities in medical image modality transformation. Typical conditional diffusion models commonly generate images with guidance of segmentation labels for medical modal transformation. Limited access to authentic guidance and its low cardinality can pose challenges to the practical clinical application of conditional diffusion models. To achieve an equilibrium of generative quality and clinical practices, we propose a novel Syncretic generative model based on the latent diffusion model for medical image translation (S$^2$LDM), which can realize high-fidelity reconstruction without demand of additional condition during inference. S$^2$LDM enhances the similarity in distinct modal images via syncretic encoding and diffusing, promoting amalgamated information in the latent space and generating medical images with more details in contrast-enhanced regions. However, syncretic latent spaces in the frequency domain tend to favor lower frequencies, commonly locate in identical anatomic structures. Thus, S$^2$LDM applies adaptive similarity loss and dynamic similarity to guide the generation and supplements the shortfall in high-frequency details throughout the training process. Quantitative experiments confirm the effectiveness of our approach in medical image translation. Our code will release lately.
- [866] arXiv:2406.13979 (cross-list from eess.IV) [pdf, html, other]
-
Title: Knowledge-driven Subspace Fusion and Gradient Coordination for Multi-modal LearningSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Multi-modal learning plays a crucial role in cancer diagnosis and prognosis. Current deep learning based multi-modal approaches are often limited by their abilities to model the complex correlations between genomics and histology data, addressing the intrinsic complexity of tumour ecosystem where both tumour and microenvironment contribute to malignancy. We propose a biologically interpretative and robust multi-modal learning framework to efficiently integrate histology images and genomics by decomposing the feature subspace of histology images and genomics, reflecting distinct tumour and microenvironment features. To enhance cross-modal interactions, we design a knowledge-driven subspace fusion scheme, consisting of a cross-modal deformable attention module and a gene-guided consistency strategy. Additionally, in pursuit of dynamically optimizing the subspace knowledge, we further propose a novel gradient coordination learning strategy. Extensive experiments demonstrate the effectiveness of the proposed method, outperforming state-of-the-art techniques in three downstream tasks of glioma diagnosis, tumour grading, and survival analysis. Our code is available at this https URL.
- [867] arXiv:2406.13989 (cross-list from stat.ML) [pdf, html, other]
-
Title: Random pairing MLE for estimation of item parameters in Rasch modelSubjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
The Rasch model, a classical model in the item response theory, is widely used in psychometrics to model the relationship between individuals' latent traits and their binary responses on assessments or questionnaires. In this paper, we introduce a new likelihood-based estimator -- random pairing maximum likelihood estimator ($\mathsf{RP\text{-}MLE}$) and its bootstrapped variant multiple random pairing MLE ($\mathsf{MRP\text{-}MLE}$) that faithfully estimate the item parameters in the Rasch model. The new estimators have several appealing features compared to existing ones. First, both work for sparse observations, an increasingly important scenario in the big data era. Second, both estimators are provably minimax optimal in terms of finite sample $\ell_{\infty}$ estimation error. Lastly, $\mathsf{RP\text{-}MLE}$ admits precise distributional characterization that allows uncertainty quantification on the item parameters, e.g., construction of confidence intervals of the item parameters. The main idea underlying $\mathsf{RP\text{-}MLE}$ and $\mathsf{MRP\text{-}MLE}$ is to randomly pair user-item responses to form item-item comparisons. This is carefully designed to reduce the problem size while retaining statistical independence. We also provide empirical evidence of the efficacy of the two new estimators using both simulated and real data.
- [868] arXiv:2406.13995 (cross-list from stat.ML) [pdf, html, other]
-
Title: Prediction of Unobserved Bifurcation by Unsupervised Extraction of Slowly Time-Varying System Parameter Dynamics from Time Series Using Reservoir ComputingComments: 17 pages, 7 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
Nonlinear and non-stationary processes are prevalent in various natural and physical phenomena, where system dynamics can change qualitatively due to bifurcation phenomena. Traditional machine learning methods have advanced our ability to learn and predict such systems from observed time series data. However, predicting the behavior of systems with temporal parameter variations without knowledge of true parameter values remains a significant challenge. This study leverages the reservoir computing framework to address this problem by unsupervised extraction of slowly varying system parameters from time series data. We propose a model architecture consisting of a slow reservoir with long timescale internal dynamics and a fast reservoir with short timescale dynamics. The slow reservoir extracts the temporal variation of system parameters, which are then used to predict unknown bifurcations in the fast dynamics. Through experiments using data generated from chaotic dynamical systems, we demonstrate the ability to predict bifurcations not present in the training data. Our approach shows potential for applications in fields such as neuroscience, material science, and weather prediction, where slow dynamics influencing qualitative changes are often unobservable.
- [869] arXiv:2406.14000 (cross-list from math.OC) [pdf, html, other]
-
Title: Robust nonlinear state-feedback control of second-order systemsComments: 6 pages, 6 figuresSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY)
This note proposes a novel nonlinear state feedback controller for perturbed second-order systems. In analogy to a linear proportional-derivative (PD) output feedback control, the proposed nonlinear scheme uses the output state of interest and its time derivative for a robust finite-time regulation. The control has only one free design parameter, and the closed-loop system is shown to be uniformly asymptotically stable in the presence of matched disturbances. We derive a strict Lyapunov function for the closed control loop with a bounded exogenous perturbation, and use it for both the control tuning and analysis of the finite-time convergence. Apart from the numerical results, a revealing experimental example is also shown in favor of the proposed control and in comparison with PD and sub-optimal nonlinear damping regulators.
- [870] arXiv:2406.14003 (cross-list from stat.ML) [pdf, html, other]
-
Title: Deep Optimal Experimental Design for Parameter Estimation ProblemsSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
Optimal experimental design is a well studied field in applied science and engineering. Techniques for estimating such a design are commonly used within the framework of parameter estimation. Nonetheless, in recent years parameter estimation techniques are changing rapidly with the introduction of deep learning techniques to replace traditional estimation methods. This in turn requires the adaptation of optimal experimental design that is associated with these new techniques. In this paper we investigate a new experimental design methodology that uses deep learning. We show that the training of a network as a Likelihood Free Estimator can be used to significantly simplify the design process and circumvent the need for the computationally expensive bi-level optimization problem that is inherent in optimal experimental design for non-linear systems. Furthermore, deep design improves the quality of the recovery process for parameter estimation problems. As proof of concept we apply our methodology to two different systems of Ordinary Differential Equations.
- [871] arXiv:2406.14009 (cross-list from stat.ML) [pdf, html, other]
-
Title: Confidence Intervals and Simultaneous Confidence Bands Based on Deep LearningSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Deep learning models have significantly improved prediction accuracy in various fields, gaining recognition across numerous disciplines. Yet, an aspect of deep learning that remains insufficiently addressed is the assessment of prediction uncertainty. Producing reliable uncertainty estimators could be crucial in practical terms. For instance, predictions associated with a high degree of uncertainty could be sent for further evaluation. Recent works in uncertainty quantification of deep learning predictions, including Bayesian posterior credible intervals and a frequentist confidence-interval estimation, have proven to yield either invalid or overly conservative intervals. Furthermore, there is currently no method for quantifying uncertainty that can accommodate deep neural networks for survival (time-to-event) data that involves right-censored outcomes. In this work, we provide a valid non-parametric bootstrap method that correctly disentangles data uncertainty from the noise inherent in the adopted optimization algorithm, ensuring that the resulting point-wise confidence intervals or the simultaneous confidence bands are accurate (i.e., valid and not overly conservative). The proposed ad-hoc method can be easily integrated into any deep neural network without interfering with the training process. The utility of the proposed approach is illustrated by constructing simultaneous confidence bands for survival curves derived from deep neural networks for survival data with right censoring.
- [872] arXiv:2406.14033 (cross-list from stat.ML) [pdf, other]
-
Title: Ensembles of Probabilistic Regression TreesAlexandre Seiller, Éric Gaussier (APTIKAL), Emilie Devijver (APTIKAL), Marianne Clausel (IECL), Sami AlkhourySubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Tree-based ensemble methods such as random forests, gradient-boosted trees, and Bayesianadditive regression trees have been successfully used for regression problems in many applicationsand research studies. In this paper, we study ensemble versions of probabilisticregression trees that provide smooth approximations of the objective function by assigningeach observation to each region with respect to a probability distribution. We prove thatthe ensemble versions of probabilistic regression trees considered are consistent, and experimentallystudy their bias-variance trade-off and compare them with the state-of-the-art interms of performance prediction.
- [873] arXiv:2406.14040 (cross-list from stat.ML) [pdf, html, other]
-
Title: A Practical Diffusion Path for SamplingSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Diffusion models are state-of-the-art methods in generative modeling when samples from a target probability distribution are available, and can be efficiently sampled, using score matching to estimate score vectors guiding a Langevin process. However, in the setting where samples from the target are not available, e.g. when this target's density is known up to a normalization constant, the score estimation task is challenging. Previous approaches rely on Monte Carlo estimators that are either computationally heavy to implement or sample-inefficient. In this work, we propose a computationally attractive alternative, relying on the so-called dilation path, that yields score vectors that are available in closed-form. This path interpolates between a Dirac and the target distribution using a convolution. We propose a simple implementation of Langevin dynamics guided by the dilation path, using adaptive step-sizes. We illustrate the results of our sampling method on a range of tasks, and shows it performs better than classical alternatives.
- [874] arXiv:2406.14044 (cross-list from physics.atm-clus) [pdf, html, other]
-
Title: Encoder-Decoder Neural Networks in Interpretation of X-ray SpectraSubjects: Atomic and Molecular Clusters (physics.atm-clus); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
Encoder-decoder neural networks (EDNN) condense information most relevant to the output of the feedforward network to activation values at a bottleneck layer. We study the use of this architecture in emulation and interpretation of simulated X-ray spectroscopic data with the aim to identify key structural characteristics for the spectra, previously studied using emulator-based component analysis (ECA). We find an EDNN to outperform ECA in covered target variable variance, but also discover complications in interpreting the latent variables in physical terms. As a compromise of the benefits of these two approaches, we develop a network where the linear projection of ECA is used, thus maintaining the beneficial characteristics of vector expansion from the latent variables for their interpretation. These results underline the necessity of information recovery after its condensation and identification of decisive structural degrees for the output spectra for a justified interpretation.
- [875] arXiv:2406.14052 (cross-list from eess.IV) [pdf, html, other]
-
Title: Perspective+ Unet: Enhancing Segmentation with Bi-Path Fusion and Efficient Non-Local Attention for Superior Receptive FieldsComments: 13 pages, 5 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Precise segmentation of medical images is fundamental for extracting critical clinical information, which plays a pivotal role in enhancing the accuracy of diagnoses, formulating effective treatment plans, and improving patient outcomes. Although Convolutional Neural Networks (CNNs) and non-local attention methods have achieved notable success in medical image segmentation, they either struggle to capture long-range spatial dependencies due to their reliance on local features, or face significant computational and feature integration challenges when attempting to address this issue with global attention mechanisms. To overcome existing limitations in medical image segmentation, we propose a novel architecture, Perspective+ Unet. This framework is characterized by three major innovations: (i) It introduces a dual-pathway strategy at the encoder stage that combines the outcomes of traditional and dilated convolutions. This not only maintains the local receptive field but also significantly expands it, enabling better comprehension of the global structure of images while retaining detail sensitivity. (ii) The framework incorporates an efficient non-local transformer block, named ENLTB, which utilizes kernel function approximation for effective long-range dependency capture with linear computational and spatial complexity. (iii) A Spatial Cross-Scale Integrator strategy is employed to merge global dependencies and local contextual cues across model stages, meticulously refining features from various levels to harmonize global and local information. Experimental results on the ACDC and Synapse datasets demonstrate the effectiveness of our proposed Perspective+ Unet. The code is available in the supplementary material.
- [876] arXiv:2406.14069 (cross-list from eess.IV) [pdf, html, other]
-
Title: Towards Multi-modality Fusion and Prototype-based Feature Refinement for Clinically Significant Prostate Cancer Classification in Transrectal UltrasoundSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Prostate cancer is a highly prevalent cancer and ranks as the second leading cause of cancer-related deaths in men globally. Recently, the utilization of multi-modality transrectal ultrasound (TRUS) has gained significant traction as a valuable technique for guiding prostate biopsies. In this study, we propose a novel learning framework for clinically significant prostate cancer (csPCa) classification using multi-modality TRUS. The proposed framework employs two separate 3D ResNet-50 to extract distinctive features from B-mode and shear wave elastography (SWE). Additionally, an attention module is incorporated to effectively refine B-mode features and aggregate the extracted features from both modalities. Furthermore, we utilize few shot segmentation task to enhance the capacity of classification encoder. Due to the limited availability of csPCa masks, a prototype correction module is employed to extract representative prototypes of csPCa. The performance of the framework is assessed on a large-scale dataset consisting of 512 TRUS videos with biopsy-proved prostate cancer. The results demonstrate the strong capability in accurately identifying csPCa, achieving an area under the curve (AUC) of 0.86. Moreover, the framework generates visual class activation mapping (CAM), which can serve as valuable assistance for localizing csPCa. These CAM images may offer valuable guidance during TRUS-guided targeted biopsies, enhancing the efficacy of the biopsy procedure.The code is available at this https URL.
- [877] arXiv:2406.14071 (cross-list from stat.ML) [pdf, html, other]
-
Title: Bayesian Bandit Algorithms with Approximate Inference in Stochastic Linear BanditsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Bayesian bandit algorithms with approximate Bayesian inference have been widely used in real-world applications. Nevertheless, their theoretical justification is less investigated in the literature, especially for contextual bandit problems. To fill this gap, we propose a general theoretical framework to analyze stochastic linear bandits in the presence of approximate inference and conduct regret analysis on two Bayesian bandit algorithms, Linear Thompson sampling (LinTS) and the extension of Bayesian Upper Confidence Bound, namely Linear Bayesian Upper Confidence Bound (LinBUCB). We demonstrate that both LinTS and LinBUCB can preserve their original rates of regret upper bound but with a sacrifice of larger constant terms when applied with approximate inference. These results hold for general Bayesian inference approaches, under the assumption that the inference error measured by two different $\alpha$-divergences is bounded. Additionally, by introducing a new definition of well-behaved distributions, we show that LinBUCB improves the regret rate of LinTS from $\tilde{O}(d^{3/2}\sqrt{T})$ to $\tilde{O}(d\sqrt{T})$, matching the minimax optimal rate. To our knowledge, this work provides the first regret bounds in the setting of stochastic linear bandits with bounded approximate inference errors.
- [878] arXiv:2406.14084 (cross-list from quant-ph) [pdf, other]
-
Title: Queen: A quick, scalable, and comprehensive quantum circuit simulation for supercomputingSubjects: Quantum Physics (quant-ph); Performance (cs.PF)
The state vector-based simulation offers a convenient approach to developing and validating quantum algorithms with noise-free results. However, limited by the absence of cache-aware implementations and unpolished circuit optimizations, the past simulators were severely constrained in performance, leading to stagnation in quantum computing. In this paper, we present an innovative quantum circuit simulation toolkit comprising gate optimization and simulation modules to address these performance challenges. For the performance, scalability, and comprehensive evaluation, we conduct a series of particular circuit benchmarks and strong scaling tests on a DGX-A100 workstation and achieve averaging 9 times speedup compared to state-of-the-art simulators, including QuEST, IBM-Aer, and NVIDIA-cuQuantum. Moreover, the critical performance metric FLOPS increases by up to a factor of 8-fold, and arithmetic intensity experiences a remarkable 96x enhancement. We believe the proposed toolkit paves the way for faster quantum circuit simulations, thereby facilitating the development of novel quantum algorithms.
- [879] arXiv:2406.14094 (cross-list from math.LO) [pdf, html, other]
-
Title: Logical reduction of relations: from relational databases to Peirce's reduction thesisComments: 31 pages, 4 figuresJournal-ref: Logic Journal of the IGPL, 2023Subjects: Logic (math.LO); Databases (cs.DB); Logic in Computer Science (cs.LO); Rings and Algebras (math.RA)
We study logical reduction (factorization) of relations into relations of lower arity by Boolean or relative products that come from applying conjunctions and existential quantifiers to predicates, i.e. by primitive positive formulas of predicate calculus. Our algebraic framework unifies natural joins and data dependencies of database theory and relational algebra of clone theory with the bond algebra of C.S. Peirce. We also offer new constructions of reductions, systematically study irreducible relations and reductions to them, and introduce a new characteristic of relations, ternarity, that measures their `complexity of relating' and allows to refine reduction results. In particular, we refine Peirce's controversial reduction thesis, and show that reducibility behavior is dramatically different on finite and infinite domains.
- [880] arXiv:2406.14118 (cross-list from eess.IV) [pdf, html, other]
-
Title: Prediction and Reference Quality Adaptation for Learned Video CompressionSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Temporal prediction is one of the most important technologies for video compression. Various prediction coding modes are designed in traditional video codecs. Traditional video codecs will adaptively to decide the optimal coding mode according to the prediction quality and reference quality. Recently, learned video codecs have made great progress. However, they ignore the prediction and reference quality adaptation, which leads to incorrect utilization of temporal prediction and reconstruction error propagation. Therefore, in this paper, we first propose a confidence-based prediction quality adaptation (PQA) module to provide explicit discrimination for the spatial and channel-wise prediction quality difference. With this module, the prediction with low quality will be suppressed and that with high quality will be enhanced. The codec can adaptively decide which spatial or channel location of predictions to use. Then, we further propose a reference quality adaptation (RQA) module and an associated repeat-long training strategy to provide dynamic spatially variant filters for diverse reference qualities. With the filters, it is easier for our codec to achieve the target reconstruction quality according to reference qualities, thus reducing the propagation of reconstruction errors. Experimental results show that our codec obtains higher compression performance than the reference software of H.266/VVC and the previous state-of-the-art learned video codecs in both RGB and YUV420 colorspaces.
- [881] arXiv:2406.14126 (cross-list from eess.SP) [pdf, other]
-
Title: Joint Optimization of Switching Point and Power Control in Dynamic TDD Cell-Free Massive MIMOComments: Presented at the Asilomar Conference on Signals, Systems, and Computers 2023Subjects: Signal Processing (eess.SP); Information Theory (cs.IT)
We consider a cell-free massive multiple-input multiple-output (CFmMIMO) network operating in dynamic time division duplex (DTDD). The switching point between the uplink (UL) and downlink (DL) data transmission phases can be adapted dynamically to the instantaneous quality-of-service (QoS) requirements in order to improve energy efficiency (EE). To this end, we formulate a problem of optimizing the DTDD switching point jointly with the UL and DL power control coefficients, and the large-scale fading decoding (LSFD) weights for EE maximization. Then, we propose an iterative algorithm to solve the formulated challenging problem using successive convex approximation with an approximate stationary solution. Simulation results show that optimizing switching points remarkably improves EE compared with baseline schemes that adjust switching points heuristically.
- [882] arXiv:2406.14142 (cross-list from q-bio.QM) [pdf, other]
-
Title: Geometric Self-Supervised Pretraining on 3D Protein Structures using SubgraphsSubjects: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
Protein representation learning aims to learn informative protein embeddings capable of addressing crucial biological questions, such as protein function prediction. Although sequence-based transformer models have shown promising results by leveraging the vast amount of protein sequence data in a self-supervised way, there is still a gap in applying these methods to 3D protein structures. In this work, we propose a pre-training scheme going beyond trivial masking methods leveraging 3D and hierarchical structures of proteins. We propose a novel self-supervised method to pretrain 3D graph neural networks on 3D protein structures, by predicting the distances between local geometric centroids of protein subgraphs and the global geometric centroid of the protein. The motivation for this method is twofold. First, the relative spatial arrangements and geometric relationships among different regions of a protein are crucial for its function. Moreover, proteins are often organized in a hierarchical manner, where smaller substructures, such as secondary structure elements, assemble into larger domains. By considering subgraphs and their relationships to the global protein structure, the model can learn to reason about these hierarchical levels of organization. We experimentally show that our proposed pertaining strategy leads to significant improvements in the performance of 3D GNNs in various protein classification tasks.
- [883] arXiv:2406.14149 (cross-list from physics.chem-ph) [pdf, html, other]
-
Title: CheMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse MoleculesComments: SI not includedSubjects: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
Progress in both Machine Learning (ML) and conventional Quantum Chemistry (QC) computational methods have resulted in high accuracy ML models for QC properties ranging from atomization energies to excitation energies. Various datasets such as MD17, MD22, and WS22, which consist of properties calculated at some level of QC method, or fidelity, have been generated to benchmark such ML models. The term fidelity refers to the accuracy of the chosen QC method to the actual real value of the property. The higher the fidelity, the more accurate the calculated property, albeit at a higher computational cost.
Research in multifidelity ML (MFML) methods, where ML models are trained on data from more than one numerical QC method, has shown the effectiveness of such models over single fidelity methods. Much research is progressing in this direction for diverse applications ranging from energy band gaps to excitation energies. A major hurdle for effective research in this field of research in the community is the lack of a diverse multifidelity dataset for benchmarking.
Here, we present a comprehensive multifidelity dataset drawn from the WS22 molecular conformations. We provide the quantum Chemistry MultiFidelity (CheMFi) dataset consisting of five fidelities calculated with the TD-DFT formalism. The fidelities differ in their basis set choice and are namely: STO-3G, 3-21G, 6-31G, def2-SVP, and def2-TZVP. CheMFi offers to the community a variety of QC properties including vertical excitation energies, oscillator strengths, molecular dipole moments, and ground state energies. In addition to the dataset, multifidelity benchmarks are set with state-of-the-art MFML and optimized-MFML - [884] arXiv:2406.14186 (cross-list from eess.IV) [pdf, html, other]
-
Title: CriDiff: Criss-cross Injection Diffusion Framework via Generative Pre-train for Prostate SegmentationComments: Accepted in MICCAI 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Recently, the Diffusion Probabilistic Model (DPM)-based methods have achieved substantial success in the field of medical image segmentation. However, most of these methods fail to enable the diffusion model to learn edge features and non-edge features effectively and to inject them efficiently into the diffusion backbone. Additionally, the domain gap between the images features and the diffusion model features poses a great challenge to prostate segmentation. In this paper, we proposed CriDiff, a two-stage feature injecting framework with a Crisscross Injection Strategy (CIS) and a Generative Pre-train (GP) approach for prostate segmentation. The CIS maximizes the use of multi-level features by efficiently harnessing the complementarity of high and low-level features. To effectively learn multi-level of edge features and non-edge features, we proposed two parallel conditioners in the CIS: the Boundary Enhance Conditioner (BEC) and the Core Enhance Conditioner (CEC), which discriminatively model the image edge regions and non-edge regions, respectively. Moreover, the GP approach eases the inconsistency between the images features and the diffusion model without adding additional parameters. Extensive experiments on four benchmark datasets demonstrate the effectiveness of the proposed method and achieve state-of-the-art performance on four evaluation metrics.
- [885] arXiv:2406.14198 (cross-list from econ.TH) [pdf, html, other]
-
Title: Guaranteed shares of benefits and costsSubjects: Theoretical Economics (econ.TH); Computer Science and Game Theory (cs.GT)
In a general fair division model with transferable utilities we discuss endogenous lower and upper guarantees on individual shares of benefits or costs. Like the more familiar exogenous bounds on individual shares described by an outside option or a stand alone utility, these guarantees depend on my type but not on others' types, only on their number and the range of types. Keeping the range from worst share to best share as narrow as permitted by the physical constraints of the model still leaves a large menu of tight guarantee functions. We describe in detail these design options in several iconic problems where each tight pair of guarantees has a clear normative meaning: the allocation of indivisible goods or costly chores, cost sharing of a public facility and the exploitation of a commons with substitute or complementary inputs. The corresponding benefit or cost functions are all sub- or super-modular, and for this class we characterise the set of minimal upper and maximal lower guarantees in all two agent problems.
- [886] arXiv:2406.14210 (cross-list from eess.IV) [pdf, html, other]
-
Title: Self-Supervised Pretext Tasks for Alzheimer's Disease Classification using 3D Convolutional Neural Networks on Large-Scale Synthetic Neuroimaging DatasetSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Structural magnetic resonance imaging (MRI) studies have shown that Alzheimer's Disease (AD) induces both localised and widespread neural degenerative changes throughout the brain. However, the absence of segmentation that highlights brain degenerative changes presents unique challenges for training CNN-based classifiers in a supervised fashion. In this work, we evaluated several unsupervised methods to train a feature extractor for downstream AD vs. CN classification. Using the 3D T1-weighted MRI data of cognitive normal (CN) subjects from the synthetic neuroimaging LDM100K dataset, lightweight 3D CNN-based models are trained for brain age prediction, brain image rotation classification, brain image reconstruction and a multi-head task combining all three tasks into one. Feature extractors trained on the LDM100K synthetic dataset achieved similar performance compared to the same model using real-world data. This supports the feasibility of utilising large-scale synthetic data for pretext task training. All the training and testing splits are performed on the subject-level to prevent data leakage issues. Alongside the simple preprocessing steps, the random cropping data augmentation technique shows consistent improvement across all experiments.
- [887] arXiv:2406.14211 (cross-list from math.OC) [pdf, html, other]
-
Title: Optimization over bounded-rank matrices through a desingularization enables joint global and local guaranteesSubjects: Optimization and Control (math.OC); Numerical Analysis (math.NA)
Convergence guarantees for optimization over bounded-rank matrices are delicate to obtain because the feasible set is a non-smooth and non-convex algebraic variety. Existing techniques include projected gradient descent, fixed-rank optimization (over the maximal-rank stratum), and the LR parameterization. They all lack either global guarantees (the ability to accumulate only at critical points) or fast local convergence (e.g., if the limit has non-maximal rank). We seek optimization algorithms that enjoy both.
Khrulkov and Oseledets [2018] parameterize the bounded-rank variety via a desingularization to recast the optimization problem onto a smooth manifold. Building on their ideas, we develop a Riemannian geometry for this desingularization, also with care for numerical considerations. We use it to secure global convergence to critical points with fast local rates, for a large range of algorithms. On matrix completion tasks, we find that this approach is comparable to others, while enjoying better general-purpose theoretical guarantees. - [888] arXiv:2406.14234 (cross-list from physics.med-ph) [pdf, html, other]
-
Title: Zero field active shieldingComments: 26 pages, 7 figuresSubjects: Medical Physics (physics.med-ph); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP); Instrumentation and Detectors (physics.ins-det); Neurons and Cognition (q-bio.NC)
Ambient field suppression is critical for accurate magnetic field measurements, and a requirement for certain low-field sensors to operate. The difference in magnitude between noise and signal (up to 10$^9$) makes the problem challenging, and solutions such as passive shielding, post-hoc processing, and most active shielding designs do not address it completely. Zero field active shielding (ZFS) achieves accurate field suppression with a feed-forward structure in which correction coils are fed by reference sensors via a matrix found using data-driven methods. Requirements are a sufficient number of correction coils and reference sensors to span the ambient field at the sensors, and to zero out the coil-to-reference sensor coupling. The solution assumes instantaneous propagation and mixing, but it can be extended to handle convolutional effects. Precise calculations based on sensor and coil geometries are not necessary, other than to improve efficiency and usability. The solution is simulated here but not implemented in hardware.
- [889] arXiv:2406.14236 (cross-list from quant-ph) [pdf, other]
-
Title: NAC-QFL: Noise Aware Clustered Quantum Federated LearningSubjects: Quantum Physics (quant-ph); Distributed, Parallel, and Cluster Computing (cs.DC)
Recent advancements in quantum computing, alongside successful deployments of quantum communication, hold promises for revolutionizing mobile networks. While Quantum Machine Learning (QML) presents opportunities, it contends with challenges like noise in quantum devices and scalability. Furthermore, the high cost of quantum communication constrains the practical application of QML in real-world scenarios. This paper introduces a noise-aware clustered quantum federated learning system that addresses noise mitigation, limited quantum device capacity, and high quantum communication costs in distributed QML. It employs noise modelling and clustering to select devices with minimal noise and distribute QML tasks efficiently. Using circuit partitioning to deploy smaller models on low-noise devices and aggregating similar devices, the system enhances distributed QML performance and reduces communication costs. Leveraging circuit cutting, QML techniques are more effective for smaller circuit sizes and fidelity. We conduct experimental evaluations to assess the performance of the proposed system. Additionally, we introduce a noisy dataset for QML to demonstrate the impact of noise on proposed accuracy.
- [890] arXiv:2406.14246 (cross-list from q-bio.QM) [pdf, html, other]
-
Title: Non-Negative Universal Differential Equations With Applications in Systems BiologyComments: 6 pages, This work has been submitted to IFAC for possible publication. Initial submission was March 18, 2024Subjects: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Dynamical Systems (math.DS); Machine Learning (stat.ML)
Universal differential equations (UDEs) leverage the respective advantages of mechanistic models and artificial neural networks and combine them into one dynamic model. However, these hybrid models can suffer from unrealistic solutions, such as negative values for biochemical quantities. We present non-negative UDE (nUDEs), a constrained UDE variant that guarantees non-negative values. Furthermore, we explore regularisation techniques to improve generalisation and interpretability of UDEs.
- [891] arXiv:2406.14264 (cross-list from eess.IV) [pdf, html, other]
-
Title: Zero-Shot Image Denoising for High-Resolution Electron MicroscopyComments: 12 pages, 12 figuresSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
High-resolution electron microscopy (HREM) imaging technique is a powerful tool for directly visualizing a broad range of materials in real-space. However, it faces challenges in denoising due to ultra-low signal-to-noise ratio (SNR) and scarce data availability. In this work, we propose Noise2SR, a zero-shot self-supervised learning (ZS-SSL) denoising framework for HREM. Within our framework, we propose a super-resolution (SR) based self-supervised training strategy, incorporating the Random Sub-sampler module. The Random Sub-sampler is designed to generate approximate infinite noisy pairs from a single noisy image, serving as an effective data augmentation in zero-shot denoising. Noise2SR trains the network with paired noisy images of different resolutions, which is conducted via SR strategy. The SR-based training facilitates the network adopting more pixels for supervision, and the random sub-sampling helps compel the network to learn continuous signals enhancing the robustness. Meanwhile, we mitigate the uncertainty caused by random-sampling by adopting minimum mean squared error (MMSE) estimation for the denoised results. With the distinctive integration of training strategy and proposed designs, Noise2SR can achieve superior denoising performance using a single noisy HREM image. We evaluate the performance of Noise2SR in both simulated and real HREM denoising tasks. It outperforms state-of-the-art ZS-SSL methods and achieves comparable denoising performance with supervised methods. The success of Noise2SR suggests its potential for improving the SNR of images in material imaging domains.
- [892] arXiv:2406.14285 (cross-list from quant-ph) [pdf, html, other]
-
Title: A Survey of Methods for Mitigating Barren Plateaus for Parameterized Quantum CircuitsComments: 18 pagesSubjects: Quantum Physics (quant-ph); Emerging Technologies (cs.ET)
Barren Plateaus are a formidable challenge for hybrid quantum-classical algorithms that lead to flat plateaus in the loss function landscape making it difficult to take advantage of the expressive power of parameterized quantum circuits with gradient-based methods. Like in classical neural network models, parameterized quantum circuits suffer the same vanishing gradient issue due to large parameter spaces with non-convex landscapes. In this review, we present an overview of the different genesis for barren plateaus, mathematical formalisms of common themes around barren plateaus, and dives into gradients. The central objective is to provide a conceptual perspective between classical and quantum interpretations of vanishing gradients as well as dive into techniques involving cost functions, entanglement, and initialization strategies to mitigate barren plateaus. Addressing barren plateaus paves the way towards feasibility of many classically intractable applications for quantum simulation, optimization, chemistry, and quantum machine learning.
- [893] arXiv:2406.14287 (cross-list from eess.IV) [pdf, html, other]
-
Title: Segmentation of Non-Small Cell Lung Carcinomas: Introducing DRU-Net and Multi-Lens DistortionSoroush Oskouei, Marit Valla, André Pedersen, Erik Smistad, Vibeke Grotnes Dale, Maren Høibø, Sissel Gyrid Freim Wahl, Mats Dehli Haugum, Thomas Langø, Maria Paula Ramnefjell, Lars Andreas Akslen, Gabriel Kiss, Hanne SorgerComments: 16 pages, 7 figures, submitted to Scientific ReportsSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Considering the increased workload in pathology laboratories today, automated tools such as artificial intelligence models can help pathologists with their tasks and ease the workload. In this paper, we are proposing a segmentation model (DRU-Net) that can provide a delineation of human non-small cell lung carcinomas and an augmentation method that can improve classification results. The proposed model is a fused combination of truncated pre-trained DenseNet201 and ResNet101V2 as a patch-wise classifier followed by a lightweight U-Net as a refinement model. We have used two datasets (Norwegian Lung Cancer Biobank and Haukeland University Hospital lung cancer cohort) to create our proposed model. The DRU-Net model achieves an average of 0.91 Dice similarity coefficient. The proposed spatial augmentation method (multi-lens distortion) improved the network performance by 3%. Our findings show that choosing image patches that specifically include regions of interest leads to better results for the patch-wise classifier compared to other sampling methods. The qualitative analysis showed that the DRU-Net model is generally successful in detecting the tumor. On the test set, some of the cases showed areas of false positive and false negative segmentation in the periphery, particularly in tumors with inflammatory and reactive changes.
- [894] arXiv:2406.14299 (cross-list from math.OC) [pdf, other]
-
Title: Symplectic Stiefel manifold: tractable metrics, second-order geometry and Newton's methodsComments: 41 pages, 5 figures, 7 tablesSubjects: Optimization and Control (math.OC); Numerical Analysis (math.NA); Symplectic Geometry (math.SG); Quantum Physics (quant-ph)
Optimization under the symplecticity constraint is an approach for solving various problems in quantum physics and scientific computing. Building on the results that this optimization problem can be transformed into an unconstrained problem on the symplectic Stiefel manifold, we construct geometric ingredients for Riemannian optimization with a new family of Riemannian metrics called tractable metrics and develop Riemannian Newton schemes. The newly obtained ingredients do not only generalize several existing results but also provide us with freedom to choose a suitable metric for each problem. To the best of our knowledge, this is the first try to develop the explicit second-order geometry and Newton's methods on the symplectic Stiefel manifold. For the Riemannian Newton method, we first consider novel operator-valued formulas for computing the Riemannian Hessian of a~cost function, which further allows the manifold to be endowed with a weighted Euclidean metric that can provide a preconditioning effect. We then solve the resulting Newton equation, as the central step of Newton's methods, directly via transforming it into a~saddle point problem followed by vectorization, or iteratively via applying any matrix-free iterative method either to the operator Newton equation or its saddle point formulation. Finally, we propose a hybrid Riemannian Newton optimization algorithm that enjoys both global convergence and quadratic/superlinear local convergence at the final stage. Various numerical experiments are presented to validate the proposed methods.
- [895] arXiv:2406.14302 (cross-list from stat.ML) [pdf, other]
-
Title: Identifiable Exchangeable Mechanisms for Causal Structure and Representation LearningSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Identifying latent representations or causal structures is important for good generalization and downstream task performance. However, both fields have been developed rather independently. We observe that several methods in both representation and causal structure learning rely on the same data-generating process (DGP), namely, exchangeable but not i.i.d. (independent and identically distributed) data. We provide a unified framework, termed Identifiable Exchangeable Mechanisms (IEM), for representation and structure learning under the lens of exchangeability. IEM provides new insights that let us relax the necessary conditions for causal structure identification in exchangeable non--i.i.d. data. We also demonstrate the existence of a duality condition in identifiable representation learning, leading to new identifiability results. We hope this work will pave the way for further research in causal representation learning.
- [896] arXiv:2406.14307 (cross-list from q-bio.GN) [pdf, html, other]
-
Title: QuST-LLM: Integrating Large Language Models for Comprehensive Spatial Transcriptomics AnalysisComments: 12 pages, 7 figuresSubjects: Genomics (q-bio.GN); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
In this paper, we introduce QuST-LLM, an innovative extension of QuPath that utilizes the capabilities of large language models (LLMs) to analyze and interpret spatial transcriptomics (ST) data. This tool effectively simplifies the intricate and high-dimensional nature of ST data by offering a comprehensive workflow that includes data loading, region selection, gene expression analysis, and functional annotation. QuST-LLM employs LLMs to transform complex ST data into understandable and detailed biological narratives based on gene ontology annotations, thereby significantly improving the interpretability of ST data. Consequently, users can interact with their own ST data using natural language. Hence, QuST-LLM provides researchers with a potent functionality to unravel the spatial and functional complexities of tissues, fostering novel insights and advancements in biomedical research.
- [897] arXiv:2406.14308 (cross-list from eess.IV) [pdf, html, other]
-
Title: FIESTA: Fourier-Based Semantic Augmentation with Uncertainty Guidance for Enhanced Domain Generalizability in Medical Image SegmentationComments: 40 pages, 7 figures, 5 tablesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Single-source domain generalization (SDG) in medical image segmentation (MIS) aims to generalize a model using data from only one source domain to segment data from an unseen target domain. Despite substantial advances in SDG with data augmentation, existing methods often fail to fully consider the details and uncertain areas prevalent in MIS, leading to mis-segmentation. This paper proposes a Fourier-based semantic augmentation method called FIESTA using uncertainty guidance to enhance the fundamental goals of MIS in an SDG context by manipulating the amplitude and phase components in the frequency domain. The proposed Fourier augmentative transformer addresses semantic amplitude modulation based on meaningful angular points to induce pertinent variations and harnesses the phase spectrum to ensure structural coherence. Moreover, FIESTA employs epistemic uncertainty to fine-tune the augmentation process, improving the ability of the model to adapt to diverse augmented data and concentrate on areas with higher ambiguity. Extensive experiments across three cross-domain scenarios demonstrate that FIESTA surpasses recent state-of-the-art SDG approaches in segmentation performance and significantly contributes to boosting the applicability of the model in medical imaging modalities.
- [898] arXiv:2406.14330 (cross-list from quant-ph) [pdf, html, other]
-
Title: Promise of Graph Sparsification and Decomposition for Noise Reduction in QAOA: Analysis for Trapped-Ion CompilationsSubjects: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC)
We develop new approximate compilation schemes that significantly reduce the expense of compiling the Quantum Approximate Optimization Algorithm (QAOA) for solving the Max-Cut problem. Our main focus is on compilation with trapped-ion simulators using Pauli-$X$ operations and all-to-all Ising Hamiltonian $H_\text{Ising}$ evolution generated by Molmer-Sorensen or optical dipole force interactions, though some of our results also apply to standard gate-based compilations. Our results are based on principles of graph sparsification and decomposition; the former reduces the number of edges in a graph while maintaining its cut structure, while the latter breaks a weighted graph into a small number of unweighted graphs. Though these techniques have been used as heuristics in various hybrid quantum algorithms, there have been no guarantees on their performance, to the best of our knowledge. This work provides the first provable guarantees using sparsification and decomposition to improve quantum noise resilience and reduce quantum circuit complexity.
For quantum hardware that uses edge-by-edge QAOA compilations, sparsification leads to a direct reduction in circuit complexity. For trapped-ion quantum simulators implementing all-to-all $H_\text{Ising}$ pulses, we show that for a $(1-\epsilon)$ factor loss in the Max-Cut approximation ($\epsilon>0)$, our compilations improve the (worst-case) number of $H_\text{Ising}$ pulses from $O(n^2)$ to $O(n\log(n/\epsilon))$ and the (worst-case) number of Pauli-$X$ bit flips from $O(n^2)$ to $O\left(\frac{n\log(n/\epsilon)}{\epsilon^2}\right)$ for $n$-node graphs. We demonstrate significant reductions in noise are obtained in our new compilation approaches using theory and numerical calculations for trapped-ion hardware. We anticipate these approximate compilation techniques will be useful tools in a variety of future quantum computing experiments. - [899] arXiv:2406.14340 (cross-list from math.OC) [pdf, other]
-
Title: Learning rate adaptive stochastic gradient descent optimization methods: numerical simulations for deep learning methods for partial differential equations and convergence analysesComments: 68 pages, 8 figuresSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
It is known that the standard stochastic gradient descent (SGD) optimization method, as well as accelerated and adaptive SGD optimization methods such as the Adam optimizer fail to converge if the learning rates do not converge to zero (as, for example, in the situation of constant learning rates). Numerical simulations often use human-tuned deterministic learning rate schedules or small constant learning rates. The default learning rate schedules for SGD optimization methods in machine learning implementation frameworks such as TensorFlow and Pytorch are constant learning rates. In this work we propose and study a learning-rate-adaptive approach for SGD optimization methods in which the learning rate is adjusted based on empirical estimates for the values of the objective function of the considered optimization problem (the function that one intends to minimize). In particular, we propose a learning-rate-adaptive variant of the Adam optimizer and implement it in case of several neural network learning problems, particularly, in the context of deep learning approximation methods for partial differential equations such as deep Kolmogorov methods, physics-informed neural networks, and deep Ritz methods. In each of the presented learning problems the proposed learning-rate-adaptive variant of the Adam optimizer faster reduces the value of the objective function than the Adam optimizer with the default learning rate. For a simple class of quadratic minimization problems we also rigorously prove that a learning-rate-adaptive variant of the SGD optimization method converges to the minimizer of the considered minimization problem. Our convergence proof is based on an analysis of the laws of invariant measures of the SGD method as well as on a more general convergence analysis for SGD with random but predictable learning rates which we develop in this work.
- [900] arXiv:2406.14347 (cross-list from physics.chem-ph) [pdf, other]
-
Title: $\nabla^2$DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network PotentialsKuzma Khrabrov, Anton Ber, Artem Tsypin, Konstantin Ushenin, Egor Rumiantsev, Alexander Telepov, Dmitry Protasov, Ilya Shenbin, Anton Alekseev, Mikhail Shirokikh, Sergey Nikolenko, Elena Tutubalina, Artur KadurinSubjects: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
Methods of computational quantum chemistry provide accurate approximations of molecular properties crucial for computer-aided drug discovery and other areas of chemical science. However, high computational complexity limits the scalability of their applications. Neural network potentials (NNPs) are a promising alternative to quantum chemistry methods, but they require large and diverse datasets for training. This work presents a new dataset and benchmark called $\nabla^2$DFT that is based on the nablaDFT. It contains twice as much molecular structures, three times more conformations, new data types and tasks, and state-of-the-art models. The dataset includes energies, forces, 17 molecular properties, Hamiltonian and overlap matrices, and a wavefunction object. All calculations were performed at the DFT level ($\omega$B97X-D/def2-SVP) for each conformation. Moreover, $\nabla^2$DFT is the first dataset that contains relaxation trajectories for a substantial number of drug-like molecules. We also introduce a novel benchmark for evaluating NNPs in molecular property prediction, Hamiltonian prediction, and conformational optimization tasks. Finally, we propose an extendable framework for training NNPs and implement 10 models within it.
- [901] arXiv:2406.14351 (cross-list from eess.IV) [pdf, html, other]
-
Title: Automatic Labels are as Effective as Manual Labels in Biomedical Images Classification with Deep LearningNiccolò Marini, Stefano Marchesin, Lluis Borras Ferris, Simon Püttmann, Marek Wodzinski, Riccardo Fratti, Damian Podareanu, Alessandro Caputo, Svetla Boytcheva, Simona Vatrano, Filippo Fraggetta, Iris Nagtegaal, Gianmaria Silvello, Manfredo Atzori, Henning MüllerComments: pre-print of the journal paperSubjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
The increasing availability of biomedical data is helping to design more robust deep learning (DL) algorithms to analyze biomedical samples. Currently, one of the main limitations to train DL algorithms to perform a specific task is the need for medical experts to label data. Automatic methods to label data exist, however automatic labels can be noisy and it is not completely clear when automatic labels can be adopted to train DL models. This paper aims to investigate under which circumstances automatic labels can be adopted to train a DL model on the classification of Whole Slide Images (WSI). The analysis involves multiple architectures, such as Convolutional Neural Networks (CNN) and Vision Transformer (ViT), and over 10000 WSIs, collected from three use cases: celiac disease, lung cancer and colon cancer, which one including respectively binary, multiclass and multilabel data. The results allow identifying 10% as the percentage of noisy labels that lead to train competitive models for the classification of WSIs. Therefore, an algorithm generating automatic labels needs to fit this criterion to be adopted. The application of the Semantic Knowledge Extractor Tool (SKET) algorithm to generate automatic labels leads to performance comparable to the one obtained with manual labels, since it generates a percentage of noisy labels between 2-5%. Automatic labels are as effective as manual ones, reaching solid performance comparable to the one obtained training models with manual labels.
- [902] arXiv:2406.14358 (cross-list from q-bio.NC) [pdf, other]
-
Title: The neural correlates of logical-mathematical symbol systems processing resemble that of spatial cognition more than natural language processingSubjects: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
The ability to manipulate logical-mathematical symbols (LMS), encompassing tasks such as calculation, reasoning, and programming, is a cognitive skill arguably unique to humans. Considering the relatively recent emergence of this ability in human evolutionary history, it has been suggested that LMS processing may build upon more fundamental cognitive systems, possibly through neuronal recycling. Previous studies have pinpointed two primary candidates, natural language processing and spatial cognition. Existing comparisons between these domains largely relied on task-level comparison, which may be confounded by task idiosyncrasy. The present study instead compared the neural correlates at the domain level with both automated meta-analysis and synthesized maps based on three representative LMS tasks, reasoning, calculation, and mental programming. Our results revealed a more substantial cortical overlap between LMS processing and spatial cognition, in contrast to language processing. Furthermore, in regions activated by both spatial and language processing, the multivariate activation pattern for LMS processing exhibited greater multivariate similarity to spatial cognition than to language processing. A hierarchical clustering analysis further indicated that typical LMS tasks were indistinguishable from spatial cognition tasks at the neural level, suggesting an inherent connection between these two cognitive processes. Taken together, our findings support the hypothesis that spatial cognition is likely the basis of LMS processing, which may shed light on the limitations of large language models in logical reasoning, particularly those trained exclusively on textual data without explicit emphasis on spatial content.
- [903] arXiv:2406.14376 (cross-list from math.PR) [pdf, html, other]
-
Title: Multicoloured Hardcore Model: Fast Mixing and QueueingComments: 12 pages. To appear at Analysis of Algorithms (AofA) 2024. Minor typo-correction and improvements to that versionSubjects: Probability (math.PR); Discrete Mathematics (cs.DM)
We extend the hardcore model to a multicoloured version: a subset of vertices of a graph are coloured such that no vertex is adjacent to one of the same colour; uncoloured vertices do not constrain neighbours. This mathematically models multi-channel resource sharing, such as fibreoptic routing.
We analyse certain simple Glauber-type dynamics on such configurations, and find conditions which ensure fast mixing. These dynamics model a queueing system: customers queue for service at vertices, who only serve customers whilst they are coloured in the underlying configuration; uncoloured vertices sit idle. The mixing estimates are applied to control queue lengths in equilibrium. - [904] arXiv:2406.14380 (cross-list from econ.EM) [pdf, html, other]
-
Title: Estimating Treatment Effects under Recommender Interference: A Structured Neural Networks ApproachSubjects: Econometrics (econ.EM); Machine Learning (cs.LG); Methodology (stat.ME)
Recommender systems are essential for content-sharing platforms by curating personalized content. To evaluate updates of recommender systems targeting content creators, platforms frequently engage in creator-side randomized experiments to estimate treatment effect, defined as the difference in outcomes when a new (vs. the status quo) algorithm is deployed on the platform. We show that the standard difference-in-means estimator can lead to a biased treatment effect estimate. This bias arises because of recommender interference, which occurs when treated and control creators compete for exposure through the recommender system. We propose a "recommender choice model" that captures how an item is chosen among a pool comprised of both treated and control content items. By combining a structural choice model with neural networks, the framework directly models the interference pathway in a microfounded way while accounting for rich viewer-content heterogeneity. Using the model, we construct a double/debiased estimator of the treatment effect that is consistent and asymptotically normal. We demonstrate its empirical performance with a field experiment on Weixin short-video platform: besides the standard creator-side experiment, we carry out a costly blocked double-sided randomization design to obtain a benchmark estimate without interference bias. We show that the proposed estimator significantly reduces the bias in treatment effect estimates compared to the standard difference-in-means estimator.
- [905] arXiv:2406.14426 (cross-list from stat.ML) [pdf, html, other]
-
Title: Transferable Boltzmann GeneratorsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
The generation of equilibrium samples of molecular systems has been a long-standing problem in statistical physics. Boltzmann Generators are a generative machine learning method that addresses this issue by learning a transformation via a normalizing flow from a simple prior distribution to the target Boltzmann distribution of interest. Recently, flow matching has been employed to train Boltzmann Generators for small molecular systems in Cartesian coordinates. We extend this work and propose a first framework for Boltzmann Generators that are transferable across chemical space, such that they predict zero-shot Boltzmann distributions for test molecules without being retrained for these systems. These transferable Boltzmann Generators allow approximate sampling from the target distribution of unseen systems, as well as efficient reweighting to the target Boltzmann distribution. The transferability of the proposed framework is evaluated on dipeptides, where we show that it generalizes efficiently to unseen systems. Furthermore, we demonstrate that our proposed architecture enhances the efficiency of Boltzmann Generators trained on single molecular systems.
- [906] arXiv:2406.14522 (cross-list from physics.soc-ph) [pdf, html, other]
-
Title: Learning thresholds lead to stable language coexistenceComments: 5 pages, 3 figuresSubjects: Physics and Society (physics.soc-ph); Computation and Language (cs.CL)
We introduce a language competition model that incorporates the effects of memory and learning on the language shift dynamics, using the Abrams-Strogatz model as a starting point. On a coarse grained time scale, the effects of memory and learning can be expressed as thresholds on the speakers fractions. In its simplest form, the resulting model is exactly solvable. Besides the consensus on one of the two languages, the model describes additional equilibrium states that are not present in the Abrams-Strogatz model: a stable coexistence of the two languages, if both thresholds are low enough, so that the language shift processes in the two opposite directions compensate each other, and a frozen state coinciding with the initial state, when both thresholds are too high for any language shift to take place. We show numerically that these results are preserved for threshold functions of a more general shape.
- [907] arXiv:2406.14534 (cross-list from eess.IV) [pdf, html, other]
-
Title: Epicardium Prompt-guided Real-time Cardiac Ultrasound Frame-to-volume RegistrationLong Lei, Jun Zhou, Jialun Pei, Baoliang Zhao, Yueming Jin, Yuen-Chun Jeremy Teoh, Jing Qin, Pheng-Ann HengComments: This paper has been accepted by MICCAI 2024Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
A comprehensive guidance view for cardiac interventional surgery can be provided by the real-time fusion of the intraoperative 2D images and preoperative 3D volume based on the ultrasound frame-to-volume registration. However, cardiac ultrasound images are characterized by a low signal-to-noise ratio and small differences between adjacent frames, coupled with significant dimension variations between 2D frames and 3D volumes to be registered, resulting in real-time and accurate cardiac ultrasound frame-to-volume registration being a very challenging task. This paper introduces a lightweight end-to-end Cardiac Ultrasound frame-to-volume Registration network, termed CU-Reg. Specifically, the proposed model leverages epicardium prompt-guided anatomical clues to reinforce the interaction of 2D sparse and 3D dense features, followed by a voxel-wise local-global aggregation of enhanced features, thereby boosting the cross-dimensional matching effectiveness of low-quality ultrasound modalities. We further embed an inter-frame discriminative regularization term within the hybrid supervised learning to increase the distinction between adjacent slices in the same ultrasound volume to ensure registration stability. Experimental results on the reprocessed CAMUS dataset demonstrate that our CU-Reg surpasses existing methods in terms of registration accuracy and efficiency, meeting the guidance requirements of clinical cardiac interventional surgery.
Cross submissions for Friday, 21 June 2024 (showing 113 of 113 entries )
- [908] arXiv:1808.05671 (replaced) [pdf, html, other]
-
Title: On the Convergence of Adaptive Gradient Methods for Nonconvex OptimizationComments: 25 pages, 2 tables. Published in Transactions on Machine Learning Research (TMLR)Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Adaptive gradient methods are workhorses in deep learning. However, the convergence guarantees of adaptive gradient methods for nonconvex optimization have not been thoroughly studied. In this paper, we provide a fine-grained convergence analysis for a general class of adaptive gradient methods including AMSGrad, RMSProp and AdaGrad. For smooth nonconvex functions, we prove that adaptive gradient methods in expectation converge to a first-order stationary point. Our convergence rate is better than existing results for adaptive gradient methods in terms of dimension. In addition, we also prove high probability bounds on the convergence rates of AMSGrad, RMSProp as well as AdaGrad, which have not been established before. Our analyses shed light on better understanding the mechanism behind adaptive gradient methods in optimizing nonconvex objectives.
- [909] arXiv:2001.01258 (replaced) [pdf, html, other]
-
Title: The troublesome kernel -- On hallucinations, no free lunches and the accuracy-stability trade-off in inverse problemsSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Methods inspired by Artificial Intelligence (AI) are starting to fundamentally change computational science and engineering through breakthrough performances on challenging problems. However, reliability and trustworthiness of such techniques is a major concern. In inverse problems in imaging, the focus of this paper, there is increasing empirical evidence that methods may suffer from hallucinations, i.e., false, but realistic-looking artifacts; instability, i.e., sensitivity to perturbations in the data; and unpredictable generalization, i.e., excellent performance on some images, but significant deterioration on others. This paper provides a theoretical foundation for these phenomena. We give mathematical explanations for how and when such effects arise in arbitrary reconstruction methods, with several of our results taking the form of `no free lunch' theorems. Specifically, we show that (i) methods that overperform on a single image can wrongly transfer details from one image to another, creating a hallucination, (ii) methods that overperform on two or more images can hallucinate or be unstable, (iii) optimizing the accuracy-stability trade-off is generally difficult, (iv) hallucinations and instabilities, if they occur, are not rare events, and may be encouraged by standard training, (v) it may be impossible to construct optimal reconstruction maps for certain problems. Our results trace these effects to the kernel of the forward operator whenever it is nontrivial, but also apply to the case when the forward operator is ill-conditioned. Based on these insights, our work aims to spur research into new ways to develop robust and reliable AI-based methods for inverse problems in imaging.
- [910] arXiv:2008.11451 (replaced) [pdf, html, other]
-
Title: Determinantal Point Process as an alternative to NMSComments: Published in BMVC 2020Subjects: Computer Vision and Pattern Recognition (cs.CV)
We present a determinantal point process (DPP) inspired alternative to non-maximum suppression (NMS) which has become an integral step in all state-of-the-art object detection frameworks. DPPs have been shown to encourage diversity in subset selection problems. We pose NMS as a subset selection problem and posit that directly incorporating DPP like framework can improve the overall performance of the object detection system. We propose an optimization problem which takes the same inputs as NMS, but introduces a novel sub-modularity based diverse subset selection functional. Our results strongly indicate that the modifications proposed in this paper can provide consistent improvements to state-of-the-art object detection pipelines.
- [911] arXiv:2009.04544 (replaced) [pdf, html, other]
-
Title: Self-Adaptive Physics-Informed Neural Networks using a Soft Attention MechanismComments: 24 pages, 17 figures. Published in journal form as "Self-Adaptive Physics-Informed Neural Networks"Journal-ref: Journal of Computational Physics (2023), Vol. 474, 111722Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Physics-Informed Neural Networks (PINNs) have emerged recently as a promising application of deep neural networks to the numerical solution of nonlinear partial differential equations (PDEs). However, it has been recognized that adaptive procedures are needed to force the neural network to fit accurately the stubborn spots in the solution of "stiff" PDEs. In this paper, we propose a fundamentally new way to train PINNs adaptively, where the adaptation weights are fully trainable and applied to each training point individually, so the neural network learns autonomously which regions of the solution are difficult and is forced to focus on them. The self-adaptation weights specify a soft multiplicative soft attention mask, which is reminiscent of similar mechanisms used in computer vision. The basic idea behind these SA-PINNs is to make the weights increase as the corresponding losses increase, which is accomplished by training the network to simultaneously minimize the losses and maximize the weights. In addition, we show how to build a continuous map of self-adaptive weights using Gaussian Process regression, which allows the use of stochastic gradient descent in problems where conventional gradient descent is not enough to produce accurate solutions. Finally, we derive the Neural Tangent Kernel matrix for SA-PINNs and use it to obtain a heuristic understanding of the effect of the self-adaptive weights on the dynamics of training in the limiting case of infinitely-wide PINNs, which suggests that SA-PINNs work by producing a smooth equalization of the eigenvalues of the NTK matrix corresponding to the different loss terms. In numerical experiments with several linear and nonlinear benchmark problems, the SA-PINN outperformed other state-of-the-art PINN algorithm in L2 error, while using a smaller number of training epochs.
- [912] arXiv:2110.10927 (replaced) [pdf, html, other]
-
Title: SecureBoost+: Large Scale and High-Performance Vertical Federated Gradient Boosting Decision TreeSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Gradient boosting decision tree (GBDT) is an ensemble machine learning algorithm, which is widely used in industry, due to its good performance and easy interpretation. Due to the problem of data isolation and the requirement of privacy, many works try to use vertical federated learning to train machine learning models collaboratively with privacy guarantees between different data owners. SecureBoost is one of the most popular vertical federated learning algorithms for GBDT. However, in order to achieve privacy preservation, SecureBoost involves complex training procedures and time-consuming cryptography operations. This causes SecureBoost to be slow to train and does not scale to large scale data.
In this work, we propose SecureBoost+, a large-scale and high-performance vertical federated gradient boosting decision tree framework. SecureBoost+ is secure in the semi-honest model, which is the same as SecureBoost. SecureBoost+ can be scaled up to tens of millions of data samples easily. SecureBoost+ achieves high performance through several novel optimizations for SecureBoost, including ciphertext operation optimization, the introduction of new training mechanisms, and multi-classification training optimization. The experimental results show that SecureBoost+ is 6-35x faster than SecureBoost, but with the same accuracy and can be scaled up to tens of millions of data samples and thousands of feature dimensions. - [913] arXiv:2112.07436 (replaced) [pdf, html, other]
-
Title: Graph Kernel Neural NetworksLuca Cosmo, Giorgia Minello, Alessandro Bicciato, Michael Bronstein, Emanuele Rodolà, Luca Rossi, Andrea TorselloComments: IEEE Transactions on Neural Networks and Learning Systems (2024)Subjects: Machine Learning (cs.LG)
The convolution operator at the core of many modern neural architectures can effectively be seen as performing a dot product between an input matrix and a filter. While this is readily applicable to data such as images, which can be represented as regular grids in the Euclidean space, extending the convolution operator to work on graphs proves more challenging, due to their irregular structure. In this paper, we propose to use graph kernels, i.e. kernel functions that compute an inner product on graphs, to extend the standard convolution operator to the graph domain. This allows us to define an entirely structural model that does not require computing the embedding of the input graph. Our architecture allows to plug-in any type of graph kernels and has the added benefit of providing some interpretability in terms of the structural masks that are learned during the training process, similarly to what happens for convolutional masks in traditional convolutional neural networks. We perform an extensive ablation study to investigate the model hyper-parameters' impact and show that our model achieves competitive performance on standard graph classification and regression datasets.
- [914] arXiv:2202.11061 (replaced) [pdf, html, other]
-
Title: In This Apportionment Lottery, the House Always WinsComments: 33 pages, version as published by Operations ResearchSubjects: Computer Science and Game Theory (cs.GT)
Apportionment is the problem of distributing $h$ indivisible seats across states in proportion to the states' populations. In the context of the US House of Representatives, this problem has a rich history and is a prime example of interactions between mathematical analysis and political practice. Grimmett (2004) suggested to apportion seats in a randomized way such that each state receives exactly their proportional share $q_i$ of seats in expectation (ex ante proportionality) and receives either $\lfloor q_i \rfloor$ or $\lceil q_i \rceil$ many seats ex post (quota). However, there is a vast space of randomized apportionment methods satisfying these two axioms, and so we additionally consider prominent axioms from the apportionment literature. Our main result is a randomized method satisfying quota, ex ante proportionality and house monotonicity - a property that prevents paradoxes when the number of seats changes and which we require to hold ex post. This result is based on a generalization of dependent rounding on bipartite graphs, which we call cumulative rounding and which might be of independent interest, as we demonstrate via applications beyond apportionment.
- [915] arXiv:2204.06328 (replaced) [pdf, html, other]
-
Title: HuBERT-EE: Early Exiting HuBERT for Efficient Speech RecognitionComments: Accepted by INTERSPEECH 2024Subjects: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Pre-training with self-supervised models, such as Hidden-unit BERT (HuBERT) and wav2vec 2.0, has brought significant improvements in automatic speech recognition (ASR). However, these models usually require an expensive computational cost to achieve outstanding performance, slowing down the inference speed. To improve the model efficiency, we introduce an early exit scheme for ASR, namely HuBERT-EE, that allows the model to stop the inference dynamically. In HuBERT-EE, multiple early exit branches are added at the intermediate layers. When the intermediate prediction of the early exit branch is confident, the model stops the inference, and the corresponding result can be returned early. We investigate the proper early exiting criterion and fine-tuning strategy to effectively perform early exiting. Experimental results on the LibriSpeech show that HuBERT-EE can accelerate the inference of the HuBERT while simultaneously balancing the trade-off between the performance and the latency.
- [916] arXiv:2206.09642 (replaced) [pdf, html, other]
-
Title: Beyond IID: data-driven decision-making in heterogeneous environmentsSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
How should one leverage historical data when past observations are not perfectly indicative of the future, e.g., due to the presence of unobserved confounders which one cannot "correct" for? Motivated by this question, we study a data-driven decision-making framework in which historical samples are generated from unknown and different distributions assumed to lie in a heterogeneity ball with known radius and centered around the (also) unknown future (out-of-sample) distribution on which the performance of a decision will be evaluated. This work aims at analyzing the performance of central data-driven policies but also near-optimal ones in these heterogeneous environments and understanding key drivers of performance. We establish a first result which allows to upper bound the asymptotic worst-case regret of a broad class of policies. Leveraging this result, for any integral probability metric, we provide a general analysis of the performance achieved by Sample Average Approximation (SAA) as a function of the radius of the heterogeneity ball. This analysis is centered around the approximation parameter, a notion of complexity we introduce to capture how the interplay between the heterogeneity and the problem structure impacts the performance of SAA. In turn, we illustrate through several widely-studied problems -- e.g., newsvendor, pricing -- how this methodology can be applied and find that the performance of SAA varies considerably depending on the combinations of problem classes and heterogeneity. The failure of SAA for certain instances motivates the design of alternative policies to achieve rate-optimality. We derive problem-dependent policies achieving strong guarantees for the illustrative problems described above and provide initial results towards a principled approach for the design and analysis of general rate-optimal algorithms.
- [917] arXiv:2208.07540 (replaced) [pdf, html, other]
-
Title: Large-Scale Minimization of the Pseudospectral AbscissaComments: 31 pages, 5 figuresSubjects: Numerical Analysis (math.NA); Optimization and Control (math.OC)
This work concerns the minimization of the pseudospectral abscissa of a matrix-valued function dependent on parameters analytically. The problem is motivated by robust stability and transient behavior considerations for a linear control system that has optimization parameters. We describe a subspace procedure to cope with the setting when the matrix-valued function is of large size. The proposed subspace procedure solves a sequence of reduced problems obtained by restricting the matrix-valued function to small subspaces, whose dimensions increase gradually. It possesses desirable features such as a superlinear convergence exhibited by the decay in the errors of the minimizers of the reduced problems. In mathematical terms, the problem we consider is a large-scale nonconvex minimax eigenvalue optimization problem such that the eigenvalue function appears in the constraint of the inner maximization problem. Devising and analyzing a subspace framework for the minimax eigenvalue optimization problem at hand with the eigenvalue function in the constraint require special treatment that makes use of a Lagrangian and dual variables. There are notable advantages in minimizing the pseudospectral abscissa over maximizing the distance to instability or minimizing the $\mathcal{H}_\infty$ norm; the optimized pseudospectral abscissa provides quantitative information about the worst-case transient growth, and the initial guesses for the parameter values to optimize the pseudospectral abscissa can be arbitrary, unlike the case to optimize the distance to instability and $\mathcal{H}_\infty$ norm that would normally require initial guesses yielding asymptotically stable systems.
- [918] arXiv:2210.00036 (replaced) [pdf, html, other]
-
Title: Differentially Private Bias-Term Fine-tuning of Foundation ModelsComments: Accepted at ICML 2024Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
We study the problem of differentially private (DP) fine-tuning of large pre-trained models -- a recent privacy-preserving approach suitable for solving downstream tasks with sensitive data. Existing work has demonstrated that high accuracy is possible under strong privacy constraint, yet requires significant computational overhead or modifications to the network architecture. We propose differentially private bias-term fine-tuning (DP-BiTFiT), which matches the state-of-the-art accuracy for DP algorithms and the efficiency of the standard BiTFiT. DP-BiTFiT is model agnostic (not modifying the network architecture), parameter efficient (only training about 0.1% of the parameters), and computation efficient (almost removing the overhead caused by DP, in both the time and space complexity). On a wide range of tasks, DP-BiTFiT is 2~30X faster and uses 2~8X less memory than DP full fine-tuning, even faster than the standard full fine-tuning. This amazing efficiency enables us to conduct DP fine-tuning on language and vision tasks with long-sequence texts and high-resolution images, which were computationally difficult using existing methods. We open-source our code at FastDP (this https URL).
- [919] arXiv:2210.00898 (replaced) [pdf, html, other]
-
Title: Robust $Q$-learning Algorithm for Markov Decision Processes under Wasserstein UncertaintySubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Probability (math.PR); Machine Learning (stat.ML)
We present a novel $Q$-learning algorithm tailored to solve distributionally robust Markov decision problems where the corresponding ambiguity set of transition probabilities for the underlying Markov decision process is a Wasserstein ball around a (possibly estimated) reference measure. We prove convergence of the presented algorithm and provide several examples also using real data to illustrate both the tractability of our algorithm as well as the benefits of considering distributional robustness when solving stochastic optimal control problems, in particular when the estimated distributions turn out to be misspecified in practice.
- [920] arXiv:2211.07547 (replaced) [pdf, html, other]
-
Title: On the Smoothed Complexity of Combinatorial Local SearchComments: An extended abstract of this paper was accepted to ICALP 2024Subjects: Computational Complexity (cs.CC); Computer Science and Game Theory (cs.GT)
We propose a unifying framework for smoothed analysis of combinatorial local optimization problems, and show how a diverse selection of problems within the complexity class PLS can be cast within this model. This abstraction allows us to identify key structural properties, and corresponding parameters, that determine the smoothed running time of local search dynamics. We formalize this via a black-box tool that provides concrete bounds on the expected maximum number of steps needed until local search reaches an exact local optimum. This bound is particularly strong, in the sense that it holds for any starting feasible solution, any choice of pivoting rule, and does not rely on the choice of specific noise distributions that are applied on the input, but it is parameterized by just a global upper bound $\phi$ on the probability density. The power of this tool can be demonstrated by instantiating it for various PLS-hard problems of interest to derive efficient smoothed running times (as a function of $\phi$ and the input size).
Most notably, we focus on the important local optimization problem of finding pure Nash equilibria in Congestion Games, that has not been studied before from a smoothed analysis perspective. Specifically, we propose novel smoothed analysis models for general and Network Congestion Games, under various representations, including explicit, step-function, and polynomial resource latencies. We study PLS-hard instances of these problems and show that their standard local search algorithms run in polynomial smoothed time.
Finally, we present further applications of our framework to a wide range of additional combinatorial problems, including local Max-Cut in weighted graphs, the Travelling Salesman problem (TSP) under the $k$-opt local heuristic, and finding pure equilibria in Network Coordination Games. - [921] arXiv:2211.14873 (replaced) [pdf, html, other]
-
Title: Which $L_p$ norm is the fairest? Approximations for fair facility location across all "$p$"Comments: A preliminary version of this article appeared as a one-page extended abstract in the proceedings of Economics and Computation (EC) 2023 conferenceSubjects: Data Structures and Algorithms (cs.DS)
Fair facility location problems try to balance access costs to open facilities borne by different groups of people by minimizing the $L_p$ norm of these group distances. However, there is no clear choice of "$p$" in the current literature. We present a novel approach to address the challenge of choosing the right notion of fairness. We introduce the concept of portfolios, a set of solutions that contains an approximately optimal solution for each objective in a given class of objectives, such as $L_p$ norms. This concept opens up new possibilities for getting around the "right" notion of fairness for many problems. For $r$ client groups, we demonstrate portfolios of size $\Theta(\log r)$ for the facility location and $k$-clustering problems, with an $O(1)$-approximate solution for each $L_p$ norm. Further, motivated by the Justice40 Initiative that provides rolling budget investments, we impose a refinement-like structure on the portfolio. We develop novel approximation algorithms for these structured portfolios and show experimental evidence of their performance in two US counties. We also present a planning tool that provides potential ways to expand access to US healthcare facilities, which might be of independent interest to policymakers.
- [922] arXiv:2212.01211 (replaced) [pdf, html, other]
-
Title: Sometimes Two Irrational Guards are NeededComments: 21 pages, 12 figuresSubjects: Computational Geometry (cs.CG)
In the art gallery problem, we are given a closed polygon $P$, with rational coordinates and an integer $k$. We are asked whether it is possible to find a set (of guards) $G$ of size $k$ such that any point $p\in P$ is seen by a point in $G$. We say two points $p$, $q$ see each other if the line segment $pq$ is contained inside $P$. It was shown by Abrahamsen, Adamaszek, and Miltzow that there is a polygon that can be guarded with three guards, but requires four guards if the guards are required to have rational coordinates. In other words, an optimal solution of size three might need to be irrational. We show that an optimal solution of size two might need to be irrational. Note that it is well-known that any polygon that can be guarded with one guard has an optimal guard placement with rational coordinates. Hence, our work closes the gap on when irrational guards are possible to occur.
- [923] arXiv:2212.10131 (replaced) [pdf, html, other]
-
Title: Hydra: Virtualized Multi-Language Runtime for High-Density Serverless PlatformsSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Programming Languages (cs.PL)
Serverless is an attractive computing model that offers seamless scalability and elasticity; it takes the infrastructure management burden away from users and enables a pay-as-you-use billing model. As a result, serverless is becoming increasingly popular to support highly elastic and bursty workloads. However, existing platforms are supported by bloated virtualization stacks which, combined with bursty and irregular invocations, leads to high memory and latency overheads.
To reduce the virtualization stack bloat, we propose Hydra, a virtualized multi-language serverless runtime capable of handling multiple invocations of functions written in different languages. To measure its impact in large platforms, we build a serverless platform that optimizes scheduling decisions to take advantage of Hydra by consolidating function invocations on a single instance, reducing the total infrastructure tax. Hydra improves the overall function density (ops/GB-sec) by 4.47$\times$ on average compared NodeJS, JVM, and CPython, the state-of-art single-language runtimes used in most serverless platforms. When reproducing the Azure Functions trace, Hydra reduces the overall memory footprint by 2.1 $\times$ and reduces the number of cold starts between 4 and 48 $\times$. - [924] arXiv:2212.13854 (replaced) [pdf, other]
-
Title: A DRL Approach for RIS-Assisted Full-Duplex UL and DL Transmission: Beamforming, Phase Shift and Power OptimizationSubjects: Information Theory (cs.IT); Signal Processing (eess.SP)
We propose a deep reinforcement learning (DRL) approach for a full-duplex (FD) transmission that predicts the phase shifts of the reconfigurable intelligent surface (RIS), base station (BS) active beamformers, and the transmit powers to maximize the weighted sum rate of uplink and downlink users. Existing methods require channel state information (CSI) and residual self-interference (SI) knowledge to calculate exact active beamformers or the DRL rewards, which typically fail without CSI or residual SI. Especially for time-varying channels, estimating and signaling CSI to the DRL agent is required at each time step and is costly. We propose a two-stage DRL framework with minimal signaling overhead to address this. The first stage uses the least squares method to initiate learning by partially canceling the residual SI. The second stage uses DRL to achieve performance comparable to existing CSI-based methods without requiring the CSI or the exact residual SI. Further, the proposed DRL framework for quantized RIS phase shifts reduces the signaling from BS to the RISs using $32$ times fewer bits than the continuous version. The quantized methods reduce action space, resulting in faster convergence and $7.1\%$ and $22.28\%$ better UL and DL rates, respectively than the continuous method.
- [925] arXiv:2301.07409 (replaced) [pdf, html, other]
-
Title: Representing Noisy Image Without DenoisingComments: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
A long-standing topic in artificial intelligence is the effective recognition of patterns from noisy images. In this regard, the recent data-driven paradigm considers 1) improving the representation robustness by adding noisy samples in training phase (i.e., data augmentation) or 2) pre-processing the noisy image by learning to solve the inverse problem (i.e., image denoising). However, such methods generally exhibit inefficient process and unstable result, limiting their practical applications. In this paper, we explore a non-learning paradigm that aims to derive robust representation directly from noisy images, without the denoising as pre-processing. Here, the noise-robust representation is designed as Fractional-order Moments in Radon space (FMR), with also beneficial properties of orthogonality and rotation invariance. Unlike earlier integer-order methods, our work is a more generic design taking such classical methods as special cases, and the introduced fractional-order parameter offers time-frequency analysis capability that is not available in classical methods. Formally, both implicit and explicit paths for constructing the FMR are discussed in detail. Extensive simulation experiments and an image security application are provided to demonstrate the uniqueness and usefulness of our FMR, especially for noise robustness, rotation invariance, and time-frequency discriminability.
- [926] arXiv:2301.10762 (replaced) [pdf, html, other]
-
Title: Optimisation of seismic imaging via bilevel learningSubjects: Numerical Analysis (math.NA)
Full Waveform Inversion (FWI) is a standard algorithm in seismic imaging. Its implementation requires the a priori choice of a number of "design parameters", such as the positions of sensors for the actual measurements and one (or more) regularisation weights. In this paper we describe a novel algorithm for determining these design parameters automatically from a set of training images, using a (supervised) bilevel learning approach. In our algorithm, the upper level objective function measures the quality of the reconstructions of the training images, where the reconstructions are obtained by solving the lower level optimisation problem -- in this case FWI. Our algorithm employs (variants of) the BFGS quasi-Newton method to perform the optimisation at each level, and thus requires the repeated solution of the forward problem -- here taken to be the Helmholtz equation.
This paper focuses on the implementation of the algorithm. The novel contributions are: (i) an adjoint-state method for the efficient computation of the upper-level gradient; (ii) a complexity analysis for the bilevel algorithm, which counts the number of Helmholtz solves needed and shows this number is independent of the number of design parameters optimised; (iii) an effective preconditioning strategy for iteratively solving the linear systems required at each step of the bilevel algorithm; (iv) a smoothed extraction process for point values of the discretised wavefield, necessary for ensuring a smooth upper level objective function. The algorithm also uses an extension to the bilevel setting of classical frequency-continuation strategies, helping avoid convergence to spurious stationary points. The advantage of our algorithm is demonstrated on a problem derived from the standard Marmousi test problem. - [927] arXiv:2301.12659 (replaced) [pdf, html, other]
-
Title: GPU Accelerated Newton for Taylor Series Solutions of Polynomial Homotopies in Multiple Double PrecisionComments: Accepted by CASC 2024, the 27th International Workshop on Computer Algebra in Scientific ComputingSubjects: Numerical Analysis (math.NA); Distributed, Parallel, and Cluster Computing (cs.DC); Mathematical Software (cs.MS); Algebraic Geometry (math.AG)
A polynomial homotopy is a family of polynomial systems, typically in one parameter $t$. Our problem is to compute power series expansions of the coordinates of the solutions in the parameter $t$, accurately, using multiple double arithmetic. One application of this problem is the location of the nearest singular solution in a polynomial homotopy, via the theorem of Fabry. Power series serve as input to construct Padé approximations.
Exploiting the massive parallelism of Graphics Processing Units capable of performing several trillions floating-point operations per second, the objective is to compensate for the cost overhead caused by arithmetic with power series in multiple double precision. The application of Newton's method for this problem requires the evaluation and differentiation of polynomials, followed by solving a blocked lower triangular linear system. Experimental results are obtained on NVIDIA GPUs, in particular the RTX 2080, RTX 4080, P100, V100, and A100.
Code generated by the CAMPARY software is used to obtain results in double double, quad double, and octo double precision. The programs in this study are self contained, available in a public github repository under the GPL-v3.0 License. - [928] arXiv:2301.12901 (replaced) [pdf, html, other]
-
Title: Simulating the Integration of Urban Air Mobility into Existing Transportation Systems: A SurveyXuan Jiang, Yuhan Tang, Junzhe Cao, Vishwanath Bulusu, Hao (Frank)Yang, Xin Peng, Yunhan Zheng, Jinhua Zhao, Raja SenguptaSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Urban air mobility (UAM) has the potential to revolutionize transportation in metropolitan areas, providing a new mode of transportation that could alleviate congestion and improve accessibility. However, the integration of UAM into existing transportation systems is a complex task that requires a thorough understanding of its impact on traffic flow and capacity. In this paper, we conduct a survey to investigate the current state of research on UAM in metropolitan-scale traffic using simulation techniques. We identify key challenges and opportunities for the integration of UAM into urban transportation systems, including impacts on existing traffic patterns and congestion; safety analysis and risk assessment; potential economic and environmental benefits; and the development of shared infrastructure and routes for UAM and ground-based transportation. We also discuss the potential benefits of UAM, such as reduced travel times and improved accessibility for underserved areas. Our survey provides a comprehensive overview of the current state of research on UAM in metropolitan-scale traffic using simulation and highlights key areas for future research and development.
- [929] arXiv:2301.13006 (replaced) [pdf, html, other]
-
Title: Fast Computation of Optimal Transport via Entropy-Regularized Extragradient MethodsSubjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Optimization and Control (math.OC); Machine Learning (stat.ML)
Efficient computation of the optimal transport distance between two distributions serves as an algorithm subroutine that empowers various applications. This paper develops a scalable first-order optimization-based method that computes optimal transport to within $\varepsilon$ additive accuracy with runtime $\widetilde{O}( n^2/\varepsilon)$, where $n$ denotes the dimension of the probability distributions of interest. Our algorithm achieves the state-of-the-art computational guarantees among all first-order methods, while exhibiting favorable numerical performance compared to classical algorithms like Sinkhorn and Greenkhorn. Underlying our algorithm designs are two key elements: (a) converting the original problem into a bilinear minimax problem over probability distributions; (b) exploiting the extragradient idea -- in conjunction with entropy regularization and adaptive learning rates -- to accelerate convergence.
- [930] arXiv:2302.02224 (replaced) [pdf, html, other]
-
Title: TAP: The Attention Patch for Cross-Modal Knowledge Transfer from Unlabeled ModalityComments: Accepted to TMLRSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
This paper addresses a cross-modal learning framework, where the objective is to enhance the performance of supervised learning in the primary modality using an unlabeled, unpaired secondary modality. Taking a probabilistic approach for missing information estimation, we show that the extra information contained in the secondary modality can be estimated via Nadaraya-Watson (NW) kernel regression, which can further be expressed as a kernelized cross-attention module (under linear transformation). This expression lays the foundation for introducing The Attention Patch (TAP), a simple neural network add-on that can be trained to allow data-level knowledge transfer from the unlabeled modality. We provide extensive numerical simulations using real-world datasets to show that TAP can provide statistically significant improvement in generalization across different domains and different neural network architectures, making use of seemingly unusable unlabeled cross-modal data.
- [931] arXiv:2302.03180 (replaced) [pdf, html, other]
-
Title: Understanding User Preferences in Explainable Artificial Intelligence: A Survey and a Mapping Function ProposalSubjects: Artificial Intelligence (cs.AI)
The increasing complexity of AI systems has led to the growth of the field of Explainable Artificial Intelligence (XAI), which aims to provide explanations and justifications for the outputs of AI algorithms. While there is considerable demand for XAI, there remains a scarcity of studies aimed at comprehensively understanding the practical distinctions among different methods and effectively aligning each method with users individual needs, and ideally, offer a mapping function which can map each user with its specific needs to a method of explainability. This study endeavors to bridge this gap by conducting a thorough review of extant research in XAI, with a specific focus on Explainable Machine Learning (XML), and a keen eye on user needs. Our main objective is to offer a classification of XAI methods within the realm of XML, categorizing current works into three distinct domains: philosophy, theory, and practice, and providing a critical review for each category. Moreover, our study seeks to facilitate the connection between XAI users and the most suitable methods for them and tailor explanations to meet their specific needs by proposing a mapping function that take to account users and their desired properties and suggest an XAI method to them. This entails an examination of prevalent XAI approaches and an evaluation of their properties. The primary outcome of this study is the formulation of a clear and concise strategy for selecting the optimal XAI method to achieve a given goal, all while delivering personalized explanations tailored to individual users.
- [932] arXiv:2303.01934 (replaced) [pdf, html, other]
-
Title: CONTAIN: A Community-based Algorithm for Network ImmunizationJournal-ref: E.S. Apostol, \"{O}. Coban, C.O. Truic\u{a}. CONTAIN: A community-based algorithm for network immunization. Engineering Science and Technology, an International Journal. 55:1-10(101728), ISSN 2215-0986, July 2024Subjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
Network immunization is an automated task in the field of network analysis that involves protecting a network (modeled as a graph) from being infected by an undesired arbitrary diffusion. In this article, we consider the spread of harmful content in social networks, and we propose CONTAIN, a novel COmmuNiTy-based Algorithm for network ImmuNization. Our solution uses the network information to (1) detect harmful content spreaders, and (2) generate partitions and rank them for immunization using the subgraphs induced by each spreader, i.e., employing CONTAIN. The experimental results obtained on real-world datasets show that CONTAIN outperforms state-of-the-art solutions, i.e., NetShield and SparseShield, by immunizing the network in fewer iterations, thus, converging significantly faster than the state-of-the-art algorithms. We also compared our solution in terms of scalability with the state-of-the-art tree-based mitigation algorithm MCWDST, as well as with NetShield and SparseShield. We can conclude that our solution outperforms MCWDST and NetShield.
- [933] arXiv:2303.09435 (replaced) [pdf, html, other]
-
Title: Jump to Conclusions: Short-Cutting Transformers With Linear TransformationsJournal-ref: LREC-COLING 2024Subjects: Computation and Language (cs.CL)
Transformer-based language models create hidden representations of their inputs at every layer, but only use final-layer representations for prediction. This obscures the internal decision-making process of the model and the utility of its intermediate representations. One way to elucidate this is to cast the hidden representations as final representations, bypassing the transformer computation in-between. In this work, we suggest a simple method for such casting, using linear transformations. This approximation far exceeds the prevailing practice of inspecting hidden representations from all layers, in the space of the final layer. Moreover, in the context of language modeling, our method produces more accurate predictions from hidden layers, across various model scales, architectures, and data distributions. This allows "peeking" into intermediate representations, showing that GPT-2 and BERT often predict the final output already in early layers. We then demonstrate the practicality of our method to recent early exit strategies, showing that when aiming, for example, at retention of 95% accuracy, our approach saves additional 7.9% layers for GPT-2 and 5.4% layers for BERT. Last, we extend our method to linearly approximate sub-modules, finding that attention is most tolerant to this change. Our code and learned mappings are publicly available at this https URL.
- [934] arXiv:2303.15350 (replaced) [pdf, html, other]
-
Title: Improving Neural Topic Models with Wasserstein Knowledge DistillationComments: Accepted at ECIR 2023Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Topic modeling is a dominant method for exploring document collections on the web and in digital libraries. Recent approaches to topic modeling use pretrained contextualized language models and variational autoencoders. However, large neural topic models have a considerable memory footprint. In this paper, we propose a knowledge distillation framework to compress a contextualized topic model without loss in topic quality. In particular, the proposed distillation objective is to minimize the cross-entropy of the soft labels produced by the teacher and the student models, as well as to minimize the squared 2-Wasserstein distance between the latent distributions learned by the two models. Experiments on two publicly available datasets show that the student trained with knowledge distillation achieves topic coherence much higher than that of the original student model, and even surpasses the teacher while containing far fewer parameters than the teacher's. The distilled model also outperforms several other competitive topic models on topic coherence.
- [935] arXiv:2303.17707 (replaced) [pdf, html, other]
-
Title: Why is plausibility surprisingly problematic as an XAI criterion?Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Explainable artificial intelligence (XAI) is motivated by the problem of making AI predictions understandable, transparent, and responsible, as AI becomes increasingly impactful in society and high-stakes domains. XAI algorithms are designed to explain AI decisions in human-understandable ways. The evaluation and optimization criteria of XAI are gatekeepers for XAI algorithms to achieve their expected goals and should withstand rigorous inspection. To improve the scientific rigor of XAI, we conduct the first critical examination of a common XAI criterion: plausibility. It measures how convincing the AI explanation is to humans, and is usually quantified by metrics on feature localization or correlation of feature attribution. Our examination shows, although plausible explanations can improve users' understanding and local trust in an AI decision, doing so is at the cost of abandoning other possible approaches of enhancing understandability, increasing misleading explanations that manipulate users, being unable to achieve complementary human-AI task performance, and deteriorating users' global trust in the overall AI system. Because the flaws outweigh the benefits, we do not recommend using plausibility as a criterion to evaluate or optimize XAI algorithms. We also identify new directions to improve XAI on understandability and utility to users including complementary human-AI task performance.
- [936] arXiv:2303.17890 (replaced) [pdf, html, other]
-
Title: Fooling Polarization-based Vision using Locally Controllable Polarizing ProjectionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Polarization is a fundamental property of light that encodes abundant information regarding surface shape, material, illumination and viewing geometry. The computer vision community has witnessed a blossom of polarization-based vision applications, such as reflection removal, shape-from-polarization, transparent object segmentation and color constancy, partially due to the emergence of single-chip mono/color polarization sensors that make polarization data acquisition easier than ever. However, is polarization-based vision vulnerable to adversarial attacks? If so, is that possible to realize these adversarial attacks in the physical world, without being perceived by human eyes? In this paper, we warn the community of the vulnerability of polarization-based vision, which can be more serious than RGB-based vision. By adapting a commercial LCD projector, we achieve locally controllable polarizing projection, which is successfully utilized to fool state-of-the-art polarization-based vision algorithms for glass segmentation and color constancy. Compared with existing physical attacks on RGB-based vision, which always suffer from the trade-off between attack efficacy and eye conceivability, the adversarial attackers based on polarizing projection are contact-free and visually imperceptible, since naked human eyes can rarely perceive the difference of viciously manipulated polarizing light and ordinary illumination. This poses unprecedented risks on polarization-based vision, both in the monochromatic and trichromatic domain, for which due attentions should be paid and counter measures be considered.
- [937] arXiv:2304.08279 (replaced) [pdf, html, other]
-
Title: MoDA: Modeling Deformable 3D Objects from Casual VideosSubjects: Computer Vision and Pattern Recognition (cs.CV)
In this paper, we focus on the challenges of modeling deformable 3D objects from casual videos. With the popularity of neural radiance fields (NeRF), many works extend it to dynamic scenes with a canonical NeRF and a deformation model that achieves 3D point transformation between the observation space and the canonical space. Recent works rely on linear blend skinning (LBS) to achieve the canonical-observation transformation. However, the linearly weighted combination of rigid transformation matrices is not guaranteed to be rigid. As a matter of fact, unexpected scale and shear factors often appear. In practice, using LBS as the deformation model can always lead to skin-collapsing artifacts for bending or twisting motions. To solve this problem, we propose neural dual quaternion blend skinning (NeuDBS) to achieve 3D point deformation, which can perform rigid transformation without skin-collapsing artifacts. In the endeavor to register 2D pixels across different frames, we establish a correspondence between canonical feature embeddings that encodes 3D points within the canonical space, and 2D image features by solving an optimal transport problem. Besides, we introduce a texture filtering approach for texture rendering that effectively minimizes the impact of noisy colors outside target deformable objects. Extensive experiments on real and synthetic datasets show that our approach can reconstruct 3D models for humans and animals with better qualitative and quantitative performance than state-of-the-art methods. Project page: \url{this https URL}.
- [938] arXiv:2305.01320 (replaced) [pdf, other]
-
Title: Higher-Order Generalized Finite Differences for Variable Coefficient Diffusion OperatorsSubjects: Numerical Analysis (math.NA)
We present a novel approach of discretizing variable coefficient diffusion operators in the context of meshfree generalized finite difference methods. Our ansatz uses properties of derived operators and combines the discrete Laplace operator with reconstruction functions approximating the diffusion coefficient. Provided that the reconstructions are of a sufficiently high order, we prove that the order of accuracy of the discrete Laplace operator transfers to the derived diffusion operator. We show that the new discrete diffusion operator inherits the diagonal dominance property of the discrete Laplace operator. Finally, we present the possibility of discretizing anisotropic diffusion operators with the help of derived operators. Our numerical results for Poisson's equation and the heat equation show that even low-order reconstructions preserve the order of the underlying discrete Laplace operator for sufficiently smooth diffusion coefficients. In experiments, we demonstrate the applicability of the new discrete diffusion operator to interface problems with point clouds not aligning to the interface and numerically show first-order convergence.
- [939] arXiv:2305.03901 (replaced) [pdf, html, other]
-
Title: Synthesizing PET images from High-field and Ultra-high-field MR images Using Joint Diffusion Attention ModelTaofeng Xie, Chentao Cao, Zhuoxu Cui, Yu Guo, Caiying Wu, Xuemei Wang, Qingneng Li, Zhanli Hu, Tao Sun, Ziru Sang, Yihang Zhou, Yanjie Zhu, Dong Liang, Qiyu Jin, Hongwu Zeng, Guoqing Chen, Haifeng WangSubjects: Machine Learning (cs.LG)
MRI and PET are crucial diagnostic tools for brain diseases, as they provide complementary information on brain structure and function. However, PET scanning is costly and involves radioactive exposure, resulting in a lack of PET. Moreover, simultaneous PET and MRI at ultra-high-field are currently hardly infeasible. Ultra-high-field imaging has unquestionably proven valuable in both clinical and academic settings, especially in the field of cognitive neuroimaging. These motivate us to propose a method for synthetic PET from high-filed MRI and ultra-high-field MRI. From a statistical perspective, the joint probability distribution (JPD) is the most direct and fundamental means of portraying the correlation between PET and MRI. This paper proposes a novel joint diffusion attention model which has the joint probability distribution and attention strategy, named JDAM. JDAM has a diffusion process and a sampling process. The diffusion process involves the gradual diffusion of PET to Gaussian noise by adding Gaussian noise, while MRI remains fixed. JPD of MRI and noise-added PET was learned in the diffusion process. The sampling process is a predictor-corrector. PET images were generated from MRI by JPD of MRI and noise-added PET. The predictor is a reverse diffusion process and the corrector is Langevin dynamics. Experimental results on the public Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset demonstrate that the proposed method outperforms state-of-the-art CycleGAN for high-field MRI (3T MRI). Finally, synthetic PET images from the ultra-high-field (5T MRI and 7T MRI) be attempted, providing a possibility for ultra-high-field PET-MRI imaging.
- [940] arXiv:2305.04694 (replaced) [pdf, html, other]
-
Title: Evaluating Impact of User-Cluster Targeted Attacks in Matrix Factorisation RecommendersSubjects: Information Retrieval (cs.IR)
In practice, users of a Recommender System (RS) fall into a few clusters based on their preferences. In this work, we conduct a systematic study on user-cluster targeted data poisoning attacks on Matrix Factorisation (MF) based RS, where an adversary injects fake users with falsely crafted user-item feedback to promote an item to a specific user cluster. We analyse how user and item feature matrices change after data poisoning attacks and identify the factors that influence the effectiveness of the attack on these feature matrices. We demonstrate that the adversary can easily target specific user clusters with minimal effort and that some items are more susceptible to attacks than others. Our theoretical analysis has been validated by the experimental results obtained from two real-world datasets. Our observations from the study could serve as a motivating point to design a more robust RS.
- [941] arXiv:2305.05738 (replaced) [pdf, html, other]
-
Title: DOCTOR: A Multi-Disease Detection Continual Learning Framework Based on Wearable Medical SensorsComments: 39 pages, 14 figures. This work has been submitted to the ACM for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibleSubjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
Modern advances in machine learning (ML) and wearable medical sensors (WMSs) in edge devices have enabled ML-driven disease detection for smart healthcare. Conventional ML-driven methods for disease detection rely on customizing individual models for each disease and its corresponding WMS data. However, such methods lack adaptability to distribution shifts and new task classification classes. In addition, they need to be rearchitected and retrained from scratch for each new disease. Moreover, installing multiple ML models in an edge device consumes excessive memory, drains the battery faster, and complicates the detection process. To address these challenges, we propose DOCTOR, a multi-disease detection continual learning (CL) framework based on WMSs. It employs a multi-headed deep neural network (DNN) and a replay-style CL algorithm. The CL algorithm enables the framework to continually learn new missions where different data distributions, classification classes, and disease detection tasks are introduced sequentially. It counteracts catastrophic forgetting with a data preservation method and a synthetic data generation (SDG) module. The data preservation method preserves the most informative subset of real training data from previous missions for exemplar replay. The SDG module models the probability distribution of the real training data and generates synthetic data for generative replay while retaining data privacy. The multi-headed DNN enables DOCTOR to detect multiple diseases simultaneously based on user WMS data. We demonstrate DOCTOR's efficacy in maintaining high disease classification accuracy with a single DNN model in various CL experiments. In complex scenarios, DOCTOR achieves 1.43 times better average test accuracy, 1.25 times better F1-score, and 0.41 higher backward transfer than the naive fine-tuning framework with a small model size of less than 350KB.
- [942] arXiv:2305.13582 (replaced) [pdf, html, other]
-
Title: Translation and Fusion Improves Zero-shot Cross-lingual Information ExtractionSubjects: Computation and Language (cs.CL)
Large language models (LLMs) combined with instruction tuning have shown significant progress in information extraction (IE) tasks, exhibiting strong generalization capabilities to unseen datasets by following annotation guidelines. However, their applicability to low-resource languages remains limited due to lack of both labeled data for fine-tuning, and unlabeled text for pre-training. In this paper, we propose TransFusion, a framework in which models are fine-tuned to use English translations of low-resource language data, enabling more precise predictions through annotation fusion. Based on TransFusion, we introduce GoLLIE-TF, a cross-lingual instruction-tuned LLM for IE tasks, designed to close the performance gap between high and low-resource languages. Our experiments across twelve multilingual IE datasets spanning 50 languages demonstrate that GoLLIE-TF achieves better zero-shot cross-lingual transfer over the base model. In addition, we show that TransFusion significantly improves low-resource language named entity recognition when applied to proprietary models such as GPT-4 (+5 F1) with a prompting approach, or fine-tuning different language models including decoder-only (+14 F1) and encoder-only (+13 F1) architectures.
- [943] arXiv:2305.14736 (replaced) [pdf, html, other]
-
Title: Optimal Control of Logically Constrained Partially Observable and Multi-Agent Markov Decision ProcessesComments: arXiv admin note: substantial text overlap with arXiv:2203.09038Subjects: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Systems and Control (eess.SY)
Autonomous systems often have logical constraints arising, for example, from safety, operational, or regulatory requirements. Such constraints can be expressed using temporal logic specifications. The system state is often partially observable. Moreover, it could encompass a team of multiple agents with a common objective but disparate information structures and constraints. In this paper, we first introduce an optimal control theory for partially observable Markov decision processes (POMDPs) with finite linear temporal logic constraints. We provide a structured methodology for synthesizing policies that maximize a cumulative reward while ensuring that the probability of satisfying a temporal logic constraint is sufficiently high. Our approach comes with guarantees on approximate reward optimality and constraint satisfaction. We then build on this approach to design an optimal control framework for logically constrained multi-agent settings with information asymmetry. We illustrate the effectiveness of our approach by implementing it on several case studies.
- [944] arXiv:2305.18353 (replaced) [pdf, html, other]
-
Title: Emergent representations in networks trained with the Forward-Forward algorithmNiccolò Tosato, Lorenzo Basile, Emanuele Ballarin, Giuseppe de Alteriis, Alberto Cazzaniga, Alessio AnsuiniSubjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
The Backpropagation algorithm has often been criticised for its lack of biological realism. In an attempt to find a more biologically plausible alternative, the recently introduced Forward-Forward algorithm replaces the forward and backward passes of Backpropagation with two forward passes. In this work, we show that the internal representations obtained by the Forward-Forward algorithm can organise into category-specific ensembles exhibiting high sparsity - composed of a low number of active units. This situation is reminiscent of what has been observed in cortical sensory areas, where neuronal ensembles are suggested to serve as the functional building blocks for perception and action. Interestingly, while this sparse pattern does not typically arise in models trained with standard Backpropagation, it can emerge in networks trained with Backpropagation on the same objective proposed for the Forward-Forward algorithm. These results suggest that the learning procedure proposed by Forward-Forward may be superior to Backpropagation in modelling learning in the cortex, even when a backward pass is used.
- [945] arXiv:2305.18403 (replaced) [pdf, html, other]
-
Title: LoRAPrune: Pruning Meets Low-Rank Parameter-Efficient Fine-TuningComments: accepted by acl 2024 findingsSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Large Language Models (LLMs), such as LLaMA and T5, have shown exceptional performance across various tasks through fine-tuning. Although low-rank adaption (LoRA) has emerged to cheaply fine-tune these LLMs on downstream tasks, their deployment is still hindered by the vast model scale and computational costs. Post-training model pruning offers a way to compress LLMs. However, the current pruning methods designed for LLMs are not compatible with LoRA. This is due to their utilization of unstructured pruning on LLMs, impeding the merging of LoRA weights, or their dependence on the gradients of pre-trained weights to guide pruning, which can impose significant memory overhead. To this end, we propose LoRAPrune, a new framework that delivers an accurate structured pruned model in a highly memory-efficient manner. Specifically, we first design a LoRA-guided pruning criterion, which uses the weights and gradients of LoRA, rather than the gradients of pre-trained weights for importance estimation. We subsequently integrate this criterion into an iterative pruning process, effectively removing redundant channels and heads. Extensive experimental results demonstrate the superior performance of our LoRAPrune over existing approaches on the LLaMA series models. At a 50\% compression rate, LoRAPrune demonstrates superior performance over LLM-Pruner, achieving a reduction in perplexity by 4.81 on WikiText2 and 3.46 on PTB, while also decreasing memory usage by 52.6%. Besides, LoRAPrune also matches semi-structural pruning across multiple LLMs, proving its wide applicability. The code is available at this https URL.
- [946] arXiv:2305.19035 (replaced) [pdf, html, other]
-
Title: Solving Robust MDPs through No-Regret DynamicsComments: Transactions of Machine Learning ResearchSubjects: Machine Learning (cs.LG)
Reinforcement Learning is a powerful framework for training agents to navigate different situations, but it is susceptible to changes in environmental dynamics. However, solving Markov Decision Processes that are robust to changes is difficult due to nonconvexity and size of action or state spaces. While most works have analyzed this problem by taking different assumptions on the problem, a general and efficient theoretical analysis is still missing. However, we generate a simple framework for improving robustness by solving a minimax iterative optimization problem where a policy player and an environmental dynamics player are playing against each other. Leveraging recent results in online nonconvex learning and techniques from improving policy gradient methods, we yield an algorithm that maximizes the robustness of the Value Function on the order of $\mathcal{O}\left(\frac{1}{T^{\frac{1}{2}}}\right)$ where $T$ is the number of iterations of the algorithm.
- [947] arXiv:2305.19118 (replaced) [pdf, html, other]
-
Title: Encouraging Divergent Thinking in Large Language Models through Multi-Agent DebateTian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, Shuming ShiComments: Work in progressSubjects: Computation and Language (cs.CL)
Modern large language models (LLMs) like ChatGPT have shown remarkable performance on general language tasks but still struggle on complex reasoning tasks, which drives the research on cognitive behaviors of LLMs to explore human-like problem-solving strategies. Along this direction, one representative strategy is self-reflection, which asks an LLM to refine the solution with the feedback generated by itself iteratively. However, our study shows that such reflection-style methods suffer from the Degeneration-of-Thought (DoT) problem: once the LLM has established confidence in its solutions, it is unable to generate novel thoughts later through reflection even if its initial stance is incorrect. To address the DoT problem, we propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution. Clearly, our MAD framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation. Experiment results on two challenging datasets, commonsense machine translation and counter-intuitive arithmetic reasoning, demonstrate the effectiveness of our MAD framework. Extensive analyses suggest that the adaptive break of debate and the modest level of "tit for tat" state are required for MAD to obtain good performance. Moreover, we find that LLMs might not be a fair judge if different LLMs are used for agents. Code is available at this https URL.
- [948] arXiv:2306.05486 (replaced) [pdf, html, other]
-
Title: Multilevel domain decomposition-based architectures for physics-informed neural networksComments: 42 pages, 12 figuresSubjects: Numerical Analysis (math.NA)
Physics-informed neural networks (PINNs) are a powerful approach for solving problems involving differential equations, yet they often struggle to solve problems with high frequency and/or multi-scale solutions. Finite basis physics-informed neural networks (FBPINNs) improve the performance of PINNs in this regime by combining them with an overlapping domain decomposition approach. In this work, FBPINNs are extended by adding multiple levels of domain decompositions to their solution ansatz, inspired by classical multilevel Schwarz domain decomposition methods (DDMs). Analogous to typical tests for classical DDMs, we assess how the accuracy of PINNs, FBPINNs and multilevel FBPINNs scale with respect to computational effort and solution complexity by carrying out strong and weak scaling tests. Our numerical results show that the proposed multilevel FBPINNs consistently and significantly outperform PINNs across a range of problems with high frequency and multi-scale solutions. Furthermore, as expected in classical DDMs, we show that multilevel FBPINNs improve the accuracy of FBPINNs when using large numbers of subdomains by aiding global communication between subdomains.
- [949] arXiv:2306.06247 (replaced) [pdf, html, other]
-
Title: Online Learning with Set-Valued FeedbackComments: Accepted to COLT 2024Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We study a variant of online multiclass classification where the learner predicts a single label but receives a \textit{set of labels} as feedback. In this model, the learner is penalized for not outputting a label contained in the revealed set. We show that unlike online multiclass learning with single-label feedback, deterministic and randomized online learnability are \textit{not equivalent} even in the realizable setting with set-valued feedback. Accordingly, we give two new combinatorial dimensions, named the Set Littlestone and Measure Shattering dimension, that tightly characterize deterministic and randomized online learnability respectively in the realizable setting. In addition, we show that the Measure Shattering dimension characterizes online learnability in the agnostic setting and tightly quantifies the minimax regret. Finally, we use our results to establish bounds on the minimax regret for three practical learning settings: online multilabel ranking, online multilabel classification, and real-valued prediction with interval-valued response.
- [950] arXiv:2306.06862 (replaced) [pdf, other]
-
Title: Saltation Matrices: The Essential Tool for Linearizing Hybrid Dynamical SystemsSubjects: Robotics (cs.RO)
Hybrid dynamical systems, i.e. systems that have both continuous and discrete states, are ubiquitous in engineering, but are difficult to work with due to their discontinuous transitions. For example, a robot leg is able to exert very little control effort while it is in the air compared to when it is on the ground. When the leg hits the ground, the penetrating velocity instantaneously collapses to zero. These instantaneous changes in dynamics and discontinuities (or jumps) in state make standard smooth tools for planning, estimation, control, and learning difficult for hybrid systems. One of the key tools for accounting for these jumps is called the saltation matrix. The saltation matrix is the sensitivity update when a hybrid jump occurs and has been used in a variety of fields including robotics, power circuits, and computational neuroscience. This paper presents an intuitive derivation of the saltation matrix and discusses what it captures, where it has been used in the past, how it is used for linear and quadratic forms, how it is computed for rigid body systems with unilateral constraints, and some of the structural properties of the saltation matrix in these cases.
- [951] arXiv:2306.08529 (replaced) [pdf, other]
-
Title: SQL2Circuits: Estimating Metrics for SQL Queries with a Quantum Natural Language Processing MethodComments: 15 pages, 13 figures, 2 tablesSubjects: Databases (cs.DB); Machine Learning (cs.LG); Quantum Physics (quant-ph)
In recent years, advances in quantum computing have led to accelerating research on quantum applications across fields. Here, we introduce a quantum machine learning model as a potential solution to the classical question in database research: the estimation of metrics for SQL queries. This work employs a quantum natural language processing (QNLP)-inspired approach for constructing a quantum machine learning model that can classify SQL queries with respect to their cardinalities, costs, and execution times. The model consists of an encoding mechanism and a training phase, including classical and quantum subroutines. The encoding mechanism encodes SQL queries as parametrized quantum circuits. In the training phase, we utilize classical optimization algorithms, such as SPSA and Adam, to optimize the circuit parameters to make predictions about the query metrics. We conclude that our model reaches an accuracy equivalent to that of the QNLP model in the binary classification tasks. Moreover, we extend the previous work by adding 4-class classification tasks and compare the cardinality estimation results to the state-of-the-art databases. We perform a theoretical analysis of the quantum machine learning model by calculating its expressibility and entangling capabilities. The analysis shows that the model has advantageous properties that make it expressible but also not too complex to be executed on the existing quantum hardware.
- [952] arXiv:2306.09293 (replaced) [pdf, html, other]
-
Title: [Experiments & Analysis] Evaluating the Feasibility of Sampling-Based Techniques for Training Multilayer PerceptronsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
The training process of neural networks is known to be time-consuming, and having a deep architecture only aggravates the issue. This process consists mostly of matrix operations, among which matrix multiplication is the bottleneck. Several sampling-based techniques have been proposed for speeding up the training time of deep neural networks by approximating the matrix products. These techniques fall under two categories: (i) sampling a subset of nodes in every hidden layer as active at every iteration and (ii) sampling a subset of nodes from the previous layer to approximate the current layer's activations using the edges from the sampled nodes. In both cases, the matrix products are computed using only the selected samples. In this paper, we evaluate the feasibility of these approaches on CPU machines with limited computational resources. Making a connection between the two research directions as special cases of approximating matrix multiplications in the context of neural networks, we provide a negative theoretical analysis that shows feedforward approximation is an obstacle against scalability. We conduct comprehensive experimental evaluations that demonstrate the most pressing challenges and limitations associated with the studied approaches. We observe that the hashing-based node selection method is not scalable to a large number of layers, confirming our theoretical analysis. Finally, we identify directions for future research.
- [953] arXiv:2306.09774 (replaced) [pdf, html, other]
-
Title: Vessim: A Testbed for Carbon-Aware Applications and SystemsComments: HotCarbon'24Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Systems and Control (eess.SY)
To reduce the carbon footprint of computing and stabilize electricity grids, there is an increasing focus on approaches that align the power usage of IT infrastructure with the availability of clean energy. Unfortunately, research on energy-aware and carbon-aware applications, as well as the interfaces between computing and energy systems, remains complex due to the scarcity of available testing environments. To this day, almost all new approaches are evaluated on custom simulation testbeds, which leads to repeated development efforts and limited comparability of results.
In this paper, we present Vessim, a co-simulation environment for testing applications and computing systems that interact with their energy systems. Our testbed connects domain-specific simulators for renewable power generation and energy storage, and enables users to implement interfaces to integrate real systems through software and hardware-in-the-loop simulation. Vessim offers an easy-to-use interface, is extendable to new simulators, and provides direct access to historical datasets. We aim to not only accelerate research in carbon-aware computing but also facilitate development and operation, as in continuous testing or digital twins. Vessim is publicly available: this https URL. - [954] arXiv:2307.01646 (replaced) [pdf, html, other]
-
Title: SwinGNN: Rethinking Permutation Invariance in Diffusion Models for Graph GenerationComments: TMLR 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Diffusion models based on permutation-equivariant networks can learn permutation-invariant distributions for graph data. However, in comparison to their non-invariant counterparts, we have found that these invariant models encounter greater learning challenges since 1) their effective target distributions exhibit more modes; 2) their optimal one-step denoising scores are the score functions of Gaussian mixtures with more components. Motivated by this analysis, we propose a non-invariant diffusion model, called $\textit{SwinGNN}$, which employs an efficient edge-to-edge 2-WL message passing network and utilizes shifted window based self-attention inspired by SwinTransformers. Further, through systematic ablations, we identify several critical training and sampling techniques that significantly improve the sample quality of graph generation. At last, we introduce a simple post-processing trick, $\textit{i.e.}$, randomly permuting the generated graphs, which provably converts any graph generative model to a permutation-invariant one. Extensive experiments on synthetic and real-world protein and molecule datasets show that our SwinGNN achieves state-of-the-art performances. Our code is released at this https URL.
- [955] arXiv:2307.01927 (replaced) [pdf, html, other]
-
Title: Safe Connectivity Maintenance of Underactuated Multi-Agent Networks in Dynamic Oceanic EnvironmentsComments: 8 pages, Published at European Control Conference 2024 (ECC 2024) Nicolas Hoischen and Marius Wiggert contributed equally to this workSubjects: Systems and Control (eess.SY)
Autonomous multi-agent systems are increasingly being deployed in environments where winds and ocean currents have a significant influence. Recent work has developed control policies for single agents that leverage flows to achieve their objectives in dynamic environments. However, in multi-agent systems, these flows can cause agents to collide or drift apart and lose direct inter-agent communications, especially when agents have low propulsion capabilities. To address these challenges, we propose a hierarchical multi-agent control approach that allows arbitrary single-agent performance policies that are unaware of other agents to be used in multi-agent systems while ensuring safe operation. We first develop a safety controller using potential functions, solely dedicated to avoiding collisions and maintaining inter-agent communication. Next, we design a low-interference safe interaction (LISIC) policy that trades off the performance policy and the safety control to ensure safe and performant operation. Specifically, when the agents are at an appropriate distance, LISIC prioritizes the performance policy while smoothly increasing the safety controller when necessary. We prove that under mild assumptions on the flows experienced by the agents, our approach can guarantee safety. Additionally, we demonstrate the effectiveness of our method in realistic settings through an extensive empirical analysis with simulations of fleets of underactuated autonomous surface vehicles operating in dynamic ocean currents where these assumptions do not always hold.
- [956] arXiv:2307.04323 (replaced) [pdf, html, other]
-
Title: Optimal $(2,\delta)$ Locally Repairable Codes via Punctured Simplex CodesSubjects: Information Theory (cs.IT)
Locally repairable codes (LRCs) have attracted a lot of attention due to their applications in distributed storage systems. In this paper, we provide new constructions of optimal $(2, \delta)$-LRCs over $\mathbb{F}_q$ with flexible parameters. Firstly, employing techniques from finite geometry, we introduce a simple yet useful condition to ensure that a punctured simplex code becomes a $(2, \delta)$-LRC. It is worth noting that this condition only imposes a requirement on the size of the puncturing set. Secondly, utilizing character sums over finite fields and Krawtchouk polynomials, we determine the parameters of more punctured simplex codes with puncturing sets of new structures. Several infinite families of LRCs with new parameters are derived. All of our new LRCs are optimal with respect to the generalized Cadambe-Mazumdar bound and some of them are also Griesmer codes or distance-optimal codes.
- [957] arXiv:2307.06090 (replaced) [pdf, html, other]
-
Title: Can Large Language Models Aid in Annotating Speech Emotional Data? Uncovering New FrontiersComments: Accepted in IEEE Computational Intelligence MagazineSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Despite recent advancements in speech emotion recognition (SER) models, state-of-the-art deep learning (DL) approaches face the challenge of the limited availability of annotated data. Large language models (LLMs) have revolutionised our understanding of natural language, introducing emergent properties that broaden comprehension in language, speech, and vision. This paper examines the potential of LLMs to annotate abundant speech data, aiming to enhance the state-of-the-art in SER. We evaluate this capability across various settings using publicly available speech emotion classification datasets. Leveraging ChatGPT, we experimentally demonstrate the promising role of LLMs in speech emotion data annotation. Our evaluation encompasses single-shot and few-shots scenarios, revealing performance variability in SER. Notably, we achieve improved results through data augmentation, incorporating ChatGPT-annotated samples into existing datasets. Our work uncovers new frontiers in speech emotion classification, highlighting the increasing significance of LLMs in this field moving forward.
- [958] arXiv:2307.06930 (replaced) [pdf, html, other]
-
Title: mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMsComments: ALVR Workshop 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Modular vision-language models (Vision-LLMs) align pretrained image encoders with (frozen) large language models (LLMs) and post-hoc condition LLMs to `understand' the image input. With the abundance of readily available high-quality English image-text data as well as strong monolingual English LLMs, the research focus has been on English-only Vision-LLMs. Multilingual vision-language models are still predominantly obtained via expensive end-to-end pretraining, resulting in comparatively smaller models, trained on limited multilingual image data supplemented with text-only multilingual corpora. We present mBLIP, the first Vision-LLM leveraging multilingual LLMs, which we obtain in a computationally efficient manner on consumer-level hardware. To this end, we \textit{re-align} an image encoder previously tuned to an English LLM to a new, multilingual LLM using only a few million multilingual training examples derived from a mix of vision-and-language tasks, which we obtain by machine-translating high-quality English data to 95 languages. On the IGLUE benchmark and XM3600, mBLIP yields results competitive with state-of-the-art models and it greatly outperforms strong English-only Vision-LLMs like Llava 1.5. We release our model, code, and train data at \url{this https URL}.
- [959] arXiv:2307.09758 (replaced) [pdf, html, other]
-
Title: Longitudinal Data and a Semantic Similarity Reward for Chest X-Ray Report GenerationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Radiologists face high burnout rates, partially due to the increasing volume of Chest X-rays (CXRs) requiring interpretation and reporting. Automated CXR report generation holds promise for reducing this burden and improving patient care. While current models show potential, their diagnostic accuracy is limited. Our proposed CXR report generator integrates elements of the radiologist workflow and introduces a novel reward for reinforcement learning. Our approach leverages longitudinal data from a patient's prior CXR study and effectively handles cases where no prior study exist, thus mirroring the radiologist's workflow. In contrast, existing models typically lack this flexibility, often requiring prior studies for the model to function optimally. Our approach also incorporates all CXRs from a patient's study and distinguishes between report sections through section embeddings. Our reward for reinforcement learning leverages CXR-BERT, which forces our model to learn the clinical semantics of radiology reporting. We conduct experiments on publicly available datasets -- MIMIC-CXR and Open-i IU X-ray -- with metrics shown to more closely correlate with radiologists' assessment of reporting. Results from our study demonstrate that the proposed model generates reports that are more aligned with radiologists' reports than state-of-the-art models, such as those utilising large language models, reinforcement learning, and multi-task learning. The proposed model improves the diagnostic accuracy of CXR report generation, which could one day reduce radiologists' workload and enhance patient care. Our Hugging Face checkpoint (this https URL) and code (this https URL) are publicly available.
- [960] arXiv:2307.12044 (replaced) [pdf, html, other]
-
Title: Kinetic description of swarming dynamics with topological interaction and transient leadersSubjects: Numerical Analysis (math.NA); Populations and Evolution (q-bio.PE)
In this paper, we present a model describing the collective motion of birds. The model introduces spontaneous changes in direction which are initialized by few agents, here referred as leaders, whose influence act on their nearest neighbors, in the following referred as followers. Starting at the microscopic level, we develop a kinetic model that characterizes the behaviour of large flocks with transient leadership. One significant challenge lies in managing topological interactions, as identifying nearest neighbors in extensive systems can be computationally expensive. To address this, we propose a novel stochastic particle method to simulate the mesoscopic dynamics and reduce the computational cost of identifying closer agents from quadratic to logarithmic complexity using a $k$-nearest neighbours search algorithm with a binary tree. Lastly, we conduct various numerical experiments for different scenarios to validate the algorithm's effectiveness and investigate collective dynamics in both two and three dimensions.
- [961] arXiv:2307.16325 (replaced) [pdf, html, other]
-
Title: "False negative -- that one is going to kill you": Understanding Industry Perspectives of Static Analysis based Security TestingComments: Published at the IEEE Symposium on Security and Privacy 2024Subjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
The demand for automated security analysis techniques, such as static analysis based security testing (SAST) tools continues to increase. To develop SASTs that are effectively leveraged by developers for finding vulnerabilities, researchers and tool designers must understand how developers perceive, select, and use SASTs, what they expect from the tools, whether they know of the limitations of the tools, and how they address those limitations. This paper describes a qualitative study that explores the assumptions, expectations, beliefs, and challenges experienced by developers who use SASTs. We perform in-depth, semi-structured interviews with 20 practitioners who possess a diverse range of software development expertise, as well as a variety of unique security, product, and organizational backgrounds. We identify $17$ key findings that shed light on developer perceptions and desires related to SASTs, and also expose gaps in the status quo - challenging long-held beliefs in SAST design priorities. Finally, we provide concrete future directions for researchers and practitioners rooted in an analysis of our findings.
- [962] arXiv:2308.03276 (replaced) [pdf, html, other]
-
Title: Spatialyze: A Geospatial Video Analytics System with Spatial-Aware OptimizationsComments: Project Page: this https URLJournal-ref: Proc. VLDB Endow. 17 (2024) 2136-2148Subjects: Databases (cs.DB); Computer Vision and Pattern Recognition (cs.CV)
Videos that are shot using commodity hardware such as phones and surveillance cameras record various metadata such as time and location. We encounter such geospatial videos on a daily basis and such videos have been growing in volume significantly. Yet, we do not have data management systems that allow users to interact with such data effectively.
In this paper, we describe Spatialyze, a new framework for end-to-end querying of geospatial videos. Spatialyze comes with a domain-specific language where users can construct geospatial video analytic workflows using a 3-step, declarative, build-filter-observe paradigm. Internally, Spatialyze leverages the declarative nature of such workflows, the temporal-spatial metadata stored with videos, and physical behavior of real-world objects to optimize the execution of workflows. Our results using real-world videos and workflows show that Spatialyze can reduce execution time by up to 5.3x, while maintaining up to 97.1% accuracy compared to unoptimized execution. - [963] arXiv:2308.03372 (replaced) [pdf, html, other]
-
Title: XAI in Automated Fact-Checking? The Benefits Are Modest and There's No One-Explanation-Fits-AllJournal-ref: OzCHI 2023Subjects: Human-Computer Interaction (cs.HC)
The massive volume of online information along with the issue of misinformation has spurred active research in the automation of fact-checking. Like fact-checking by human experts, it is not enough for an automated fact-checker to just be accurate, but also be able to inform and convince the user of the validity of its predictions. This becomes viable with explainable artificial intelligence (XAI). In this work, we conduct a study of XAI fact-checkers involving 180 participants to determine how users' actions towards news and their attitudes towards explanations are affected by the XAI. Our results suggest that XAI has limited effects on users' agreement with the veracity prediction of the automated fact-checker and on their intent to share news. However, XAI nudges users towards forming uniform judgments of news veracity, thereby signaling their reliance on the explanations. We also found polarizing preferences towards XAI and raise several design considerations on them.
- [964] arXiv:2308.04792 (replaced) [pdf, html, other]
-
Title: NNPP: A Learning-Based Heuristic Model for Accelerating Optimal Path Planning on Uneven TerrainSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Intelligent autonomous path planning is essential for enhancing the exploration efficiency of mobile robots operating in uneven terrains like planetary surfaces and off-road this http URL this paper, we propose the NNPP model for computing the heuristic region, enabling foundation algorithms like Astar to find the optimal path solely within this reduced search space, effectively decreasing the search time. The NNPP model learns semantic information about start and goal locations, as well as map representations, from numerous pre-annotated optimal path demonstrations, and produces a probabilistic distribution over each pixel representing the likelihood of it belonging to an optimal path on the map. More specifically, the paper computes the traversal cost for each grid cell from the slope, roughness and elevation difference obtained from the digital elevation model. Subsequently, the start and goal locations are encoded using a Gaussian distribution and different location encoding parameters are analyzed for their effect on model performance. After training, the NNPP model is able to \textcolor{revision}{accelerate} path planning on novel maps.
- [965] arXiv:2308.05385 (replaced) [pdf, html, other]
-
Title: Adaptive Taxonomy Learning and Historical Patterns Modelling for Patent ClassificationComments: Accepted by the TOIS journalSubjects: Artificial Intelligence (cs.AI)
Patent classification aims to assign multiple International Patent Classification (IPC) codes to a given patent. Recent methods for automatically classifying patents mainly focus on analyzing the text descriptions of patents. However, apart from the texts, each patent is also associated with some assignees, and the knowledge of their applied patents is often valuable for classification. Furthermore, the hierarchical taxonomy formulated by the IPC system provides important contextual information and enables models to leverage the correlations between IPC codes for more accurate classification. However, existing methods fail to incorporate the above aspects. In this paper, we propose an integrated framework that comprehensively considers the information on patents for patent classification. To be specific, we first present an IPC codes correlations learning module to derive their semantic representations via adaptively passing and aggregating messages within the same level and across different levels along the hierarchical taxonomy. Moreover, we design a historical application patterns learning component to incorporate the corresponding assignee's previous patents by a dual channel aggregation mechanism. Finally, we combine the contextual information of patent texts that contains the semantics of IPC codes, and assignees' sequential preferences to make predictions. Experiments on real-world datasets demonstrate the superiority of our approach over the existing methods. Besides, we present the model's ability to capture the temporal patterns of assignees and the semantic dependencies among IPC codes.
- [966] arXiv:2308.07706 (replaced) [pdf, html, other]
-
Title: Exploring Transfer Learning in Medical Image Segmentation using Vision-Language ModelsComments: Medical Imaging with Deep Learning (MIDL) 2024 (Oral)Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Medical image segmentation allows quantifying target structure size and shape, aiding in disease diagnosis, prognosis, surgery planning, and comprehension.Building upon recent advancements in foundation Vision-Language Models (VLMs) from natural image-text pairs, several studies have proposed adapting them to Vision-Language Segmentation Models (VLSMs) that allow using language text as an additional input to segmentation models. Introducing auxiliary information via text with human-in-the-loop prompting during inference opens up unique opportunities, such as open vocabulary segmentation and potentially more robust segmentation models against out-of-distribution data. Although transfer learning from natural to medical images has been explored for image-only segmentation models, the joint representation of vision-language in segmentation problems remains underexplored. This study introduces the first systematic study on transferring VLSMs to 2D medical images, using carefully curated $11$ datasets encompassing diverse modalities and insightful language prompts and experiments. Our findings demonstrate that although VLSMs show competitive performance compared to image-only models for segmentation after finetuning in limited medical image datasets, not all VLSMs utilize the additional information from language prompts, with image features playing a dominant role. While VLSMs exhibit enhanced performance in handling pooled datasets with diverse modalities and show potential robustness to domain shifts compared to conventional segmentation models, our results suggest that novel approaches are required to enable VLSMs to leverage the various auxiliary information available through language prompts. The code and datasets are available at this https URL.
- [967] arXiv:2308.08378 (replaced) [pdf, html, other]
-
Title: Advancing continual lifelong learning in neural information retrieval: definition, dataset, framework, and empirical evaluationComments: Submitted to Information SciencesSubjects: Information Retrieval (cs.IR); Computation and Language (cs.CL)
Continual learning refers to the capability of a machine learning model to learn and adapt to new information, without compromising its performance on previously learned tasks. Although several studies have investigated continual learning methods for information retrieval tasks, a well-defined task formulation is still lacking, and it is unclear how typical learning strategies perform in this context. To address this challenge, a systematic task formulation of continual neural information retrieval is presented, along with a multiple-topic dataset that simulates continuous information retrieval. A comprehensive continual neural information retrieval framework consisting of typical retrieval models and continual learning strategies is then proposed. Empirical evaluations illustrate that the proposed framework can successfully prevent catastrophic forgetting in neural information retrieval and enhance performance on previously learned tasks. The results indicate that embedding-based retrieval models experience a decline in their continual learning performance as the topic shift distance and dataset volume of new tasks increase. In contrast, pretraining-based models do not show any such correlation. Adopting suitable learning strategies can mitigate the effects of topic shift and data augmentation.
- [968] arXiv:2308.10062 (replaced) [pdf, html, other]
-
Title: Revitalising the Single Batch Environment: A 'Quest' to Achieve Fairness and EfficiencySubjects: Operating Systems (cs.OS)
In the realm of computer systems, efficient utilisation of the CPU (Central Processing Unit) has always been a paramount concern. Researchers and engineers have long sought ways to optimise process execution on the CPU, leading to the emergence of CPU scheduling as a field of study. This research proposes a novel algorithm for batch processing that operates on a preemptive model, dynamically assigning priorities based on a robust ratio, employing a dynamic time slice, and utilising periodic sorting technique to achieve fairness. By engineering this responsive and fair model, the proposed algorithm strikes a delicate balance between efficiency and fairness, providing an optimised solution for batch scheduling while ensuring system responsiveness.
- [969] arXiv:2308.10692 (replaced) [pdf, html, other]
-
Title: Exploring Fine-Grained Representation and Recomposition for Cloth-Changing Person Re-IdentificationComments: Accepted by IEEE TIFS 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Cloth-changing person Re-IDentification (Re-ID) is a particularly challenging task, suffering from two limitations of inferior discriminative features and limited training samples. Existing methods mainly leverage auxiliary information to facilitate identity-relevant feature learning, including soft-biometrics features of shapes or gaits, and additional labels of clothing. However, this information may be unavailable in real-world applications. In this paper, we propose a novel FIne-grained Representation and Recomposition (FIRe$^{2}$) framework to tackle both limitations without any auxiliary annotation or data. Specifically, we first design a Fine-grained Feature Mining (FFM) module to separately cluster images of each person. Images with similar so-called fine-grained attributes (e.g., clothes and viewpoints) are encouraged to cluster together. An attribute-aware classification loss is introduced to perform fine-grained learning based on cluster labels, which are not shared among different people, promoting the model to learn identity-relevant features. Furthermore, to take full advantage of fine-grained attributes, we present a Fine-grained Attribute Recomposition (FAR) module by recomposing image features with different attributes in the latent space. It significantly enhances robust feature learning. Extensive experiments demonstrate that FIRe$^{2}$ can achieve state-of-the-art performance on five widely-used cloth-changing person Re-ID benchmarks. The code is available at this https URL.
- [970] arXiv:2308.12215 (replaced) [pdf, html, other]
-
Title: The Challenges of Machine Learning for Trust and Safety: A Case Study on Misinformation DetectionSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)
We examine the disconnect between scholarship and practice in applying machine learning to trust and safety problems, using misinformation detection as a case study. We survey literature on automated detection of misinformation across a corpus of 248 well-cited papers in the field. We then examine subsets of papers for data and code availability, design missteps, reproducibility, and generalizability. Our paper corpus includes published work in security, natural language processing, and computational social science. Across these disparate disciplines, we identify common errors in dataset and method design. In general, detection tasks are often meaningfully distinct from the challenges that online services actually face. Datasets and model evaluation are often non-representative of real-world contexts, and evaluation frequently is not independent of model training. We demonstrate the limitations of current detection methods in a series of three representative replication studies. Based on the results of these analyses and our literature survey, we conclude that the current state-of-the-art in fully-automated misinformation detection has limited efficacy in detecting human-generated misinformation. We offer recommendations for evaluating applications of machine learning to trust and safety problems and recommend future directions for research.
- [971] arXiv:2308.12252 (replaced) [pdf, html, other]
-
Title: How Safe Am I Given What I See? Calibrated Prediction of Safety Chances for Image-Controlled AutonomyComments: This is supplementary material to the paper: How Safe Am I Given What I See? Calibrated Prediction of Safety Chances for Image-Controlled Autonomy in 6th Annual Learning for Dynamics & Control Conference (L4DC 2024)Subjects: Machine Learning (cs.LG)
End-to-end learning has emerged as a major paradigm for developing autonomous systems. Unfortunately, with its performance and convenience comes an even greater challenge of safety assurance. A key factor of this challenge is the absence of the notion of a low-dimensional and interpretable dynamical state, around which traditional assurance methods revolve. Focusing on the online safety prediction problem, this paper proposes a configurable family of learning pipelines based on generative world models, which do not require low-dimensional states. To implement these pipelines, we overcome the challenges of learning safety-informed latent representations and missing safety labels under prediction-induced distribution shift. These pipelines come with statistical calibration guarantees on their safety chance predictions based on conformal prediction. We perform an extensive evaluation of the proposed learning pipelines on two case studies of image-controlled systems: a racing car and a cartpole.
- [972] arXiv:2308.12568 (replaced) [pdf, html, other]
-
Title: A Small and Fast BERT for Chinese Medical Punctuation RestorationComments: 5 pages, 2 figures, Accepted by INTERSPEECH 2024Subjects: Computation and Language (cs.CL)
In clinical dictation, utterances after automatic speech recognition (ASR) without explicit punctuation marks may lead to the misunderstanding of dictated reports. To give a precise and understandable clinical report with ASR, automatic punctuation restoration is required. Considering a practical scenario, we propose a fast and light pre-trained model for Chinese medical punctuation restoration based on 'pretraining and fine-tuning' paradigm. In this work, we distill pre-trained models by incorporating supervised contrastive learning and a novel auxiliary pre-training task (Punctuation Mark Prediction) to make it well-suited for punctuation restoration. Our experiments on various distilled models reveal that our model can achieve 95% performance while 10% model size relative to state-of-the-art Chinese RoBERTa.
- [973] arXiv:2308.14508 (replaced) [pdf, other]
-
Title: LongBench: A Bilingual, Multitask Benchmark for Long Context UnderstandingYushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi LiComments: ACL 2024Subjects: Computation and Language (cs.CL)
Although large language models (LLMs) demonstrate impressive performance for many language tasks, most of them can only handle texts a few thousand tokens long, limiting their applications on longer sequence inputs, such as books, reports, and codebases. Recent works have proposed methods to improve LLMs' long context capabilities by extending context windows and more sophisticated memory mechanisms. However, comprehensive benchmarks tailored for evaluating long context understanding are lacking. In this paper, we introduce LongBench, the first bilingual, multi-task benchmark for long context understanding, enabling a more rigorous evaluation of long context understanding. LongBench comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese). These tasks cover key long-text application areas including single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, and code completion. All datasets in LongBench are standardized into a unified format, allowing for effortless automatic evaluation of LLMs. Upon comprehensive evaluation of 8 LLMs on LongBench, we find that: (1) Commercial model (GPT-3.5-Turbo-16k) outperforms other open-sourced models, but still struggles on longer contexts. (2) Scaled position embedding and fine-tuning on longer sequences lead to substantial improvement on long context understanding. (3) Context compression technique such as retrieval brings improvement for model with weak ability on long contexts, but the performance still lags behind models that have strong long context understanding capability. The code and datasets are available at this https URL.
- [974] arXiv:2309.00931 (replaced) [pdf, other]
-
Title: Parametric Finite Element Discretization of the Surface Stokes EquationsComments: 40 pages, 14 figuresSubjects: Numerical Analysis (math.NA)
We study a higher-order surface finite element (SFEM) penalty-based discretization of the tangential surface Stokes problem. Several discrete formulations are investigated which are equivalent in the continuous setting. The impact of the choice of discretization of the diffusion term and of the divergence term on numerical accuracy and convergence, as well as on implementation advantages, is discussed. We analyze the inf-sup stability of the discrete scheme in a generic approach by lifting stable finite element pairs known from the literature. A discretization error analysis in tangential norms then shows optimal order convergence of an isogeometric setting that requires only geometric knowledge of the discrete surface.
- [975] arXiv:2309.04038 (replaced) [pdf, html, other]
-
Title: S-Adapter: Generalizing Vision Transformer for Face Anti-Spoofing with Statistical TokensComments: Accepted by IEEE Transactions on Information Forensics Security (June 2024)Subjects: Computer Vision and Pattern Recognition (cs.CV)
Face Anti-Spoofing (FAS) aims to detect malicious attempts to invade a face recognition system by presenting spoofed faces. State-of-the-art FAS techniques predominantly rely on deep learning models but their cross-domain generalization capabilities are often hindered by the domain shift problem, which arises due to different distributions between training and testing data. In this study, we develop a generalized FAS method under the Efficient Parameter Transfer Learning (EPTL) paradigm, where we adapt the pre-trained Vision Transformer models for the FAS task. During training, the adapter modules are inserted into the pre-trained ViT model, and the adapters are updated while other pre-trained parameters remain fixed. We find the limitations of previous vanilla adapters in that they are based on linear layers, which lack a spoofing-aware inductive bias and thus restrict the cross-domain generalization. To address this limitation and achieve cross-domain generalized FAS, we propose a novel Statistical Adapter (S-Adapter) that gathers local discriminative and statistical information from localized token histograms. To further improve the generalization of the statistical tokens, we propose a novel Token Style Regularization (TSR), which aims to reduce domain style variance by regularizing Gram matrices extracted from tokens across different domains. Our experimental results demonstrate that our proposed S-Adapter and TSR provide significant benefits in both zero-shot and few-shot cross-domain testing, outperforming state-of-the-art methods on several benchmark tests. We will release the source code upon acceptance.
- [976] arXiv:2309.04505 (replaced) [pdf, html, other]
-
Title: COVID-19 Detection System: A Comparative Analysis of System Performance Based on Acoustic Features of Cough Audio SignalsComments: 8 pages, 3 figuresJournal-ref: 2023 IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Exeter, United Kingdom, 2023, pp. 2706-2713Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
A wide range of respiratory diseases, such as cold and flu, asthma, and COVID-19, affect people's daily lives worldwide. In medical practice, respiratory sounds are widely used in medical services to diagnose various respiratory illnesses and lung disorders. The traditional diagnosis of such sounds requires specialized knowledge, which can be costly and reliant on human expertise. Despite this, recent advancements, such as cough audio recordings, have emerged as a means to automate the detection of respiratory conditions. Therefore, this research aims to explore various acoustic features that enhance the performance of machine learning (ML) models in detecting COVID-19 from cough signals. It investigates the efficacy of three feature extraction techniques, including Mel Frequency Cepstral Coefficients (MFCC), Chroma, and Spectral Contrast features, when applied to two machine learning algorithms, Support Vector Machine (SVM) and Multilayer Perceptron (MLP), and therefore proposes an efficient CovCepNet detection system. The proposed system provides a practical solution and demonstrates state-of-the-art classification performance, with an AUC of 0.843 on the COUGHVID dataset and 0.953 on the Virufy dataset for COVID-19 detection from cough audio signals.
- [977] arXiv:2309.06219 (replaced) [pdf, html, other]
-
Title: Human Action Co-occurrence in Lifestyle Vlogs using Graph Link PredictionSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR)
We introduce the task of automatic human action co-occurrence identification, i.e., determine whether two human actions can co-occur in the same interval of time. We create and make publicly available the ACE (Action Co-occurrencE) dataset, consisting of a large graph of ~12k co-occurring pairs of visual actions and their corresponding video clips. We describe graph link prediction models that leverage visual and textual information to automatically infer if two actions are co-occurring. We show that graphs are particularly well suited to capture relations between human actions, and the learned graph representations are effective for our task and capture novel and relevant information across different data domains. The ACE dataset and the code introduced in this paper are publicly available at this https URL.
- [978] arXiv:2309.06718 (replaced) [pdf, html, other]
-
Title: Immersion and Invariance-based Disturbance Observer and Its Application to Safe ControlComments: Accepted to IEEE Transactions on Automatic ControlJournal-ref: 10.1109/TAC.2024.3416323Subjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
When the disturbance input matrix is nonlinear, existing disturbance observer design methods rely on the solvability of a partial differential equation or the existence of an output function with a uniformly well-defined disturbance relative degree, which can pose significant limitations. This note introduces a systematic approach for designing an Immersion and Invariance-based Disturbance Observer (IIDOB) that circumvents these strong assumptions. The proposed IIDOB ensures the disturbance estimation error is globally uniformly ultimately bounded by approximately solving a partial differential equation while compensating for the approximation error. Furthermore, by integrating IIDOB into the framework of control barrier functions, a filter-based safe control design method for control-affine systems with disturbances is established where the filter is used to generate an alternative disturbance estimation signal with a known derivative. Sufficient conditions are established to guarantee the safety of the disturbed systems. Simulation results demonstrate the effectiveness of the proposed method.
- [979] arXiv:2309.07657 (replaced) [pdf, other]
-
Title: Sync+Sync: A Covert Channel Built on fsync with StorageComments: A full version for the paper with the same title accepted by the 33rd USENIX Security Symposium (USENIX Security 2024)Subjects: Cryptography and Security (cs.CR); Operating Systems (cs.OS)
Scientists have built a variety of covert channels for secretive information transmission with CPU cache and main memory. In this paper, we turn to a lower level in the memory hierarchy, i.e., persistent storage. Most programs store intermediate or eventual results in the form of files and some of them call fsync to synchronously persist a file with storage device for orderly persistence. Our quantitative study shows that one program would undergo significantly longer response time for fsync call if the other program is concurrently calling fsync, although they do not share any data. We further find that, concurrent fsync calls contend at multiple levels of storage stack due to sharing software structures (e.g., Ext4's journal) and hardware resources (e.g., disk's I/O dispatch queue).
We accordingly build a covert channel named Sync+Sync. Sync+Sync delivers a transmission bandwidth of 20,000 bits per second at an error rate of about 0.40% with an ordinary solid-state drive. Sync+Sync can be conducted in cross-disk partition, cross-file system, cross-container, cross-virtual machine, and even cross-disk drive fashions, without sharing data between programs. Next, we launch side-channel attacks with Sync+Sync and manage to precisely detect operations of a victim database (e.g., insert/update and B-Tree node split). We also leverage Sync+Sync to distinguish applications and websites with high accuracy by detecting and analyzing their fsync frequencies and flushed data volumes. These attacks are useful to support further fine-grained information leakage. - [980] arXiv:2309.08069 (replaced) [pdf, html, other]
-
Title: Connecting the Dots in News Analysis: Bridging the Cross-Disciplinary Disparities in Media Bias and FramingComments: Accepted to the sixth Workshop on Natural Language Processing and Computational Social (NLP+CSS)Subjects: Computation and Language (cs.CL)
The manifestation and effect of bias in news reporting have been central topics in the social sciences for decades, and have received increasing attention in the NLP community recently. While NLP can help to scale up analyses or contribute automatic procedures to investigate the impact of biased news in society, we argue that methodologies that are currently dominant fall short of addressing the complex questions and effects addressed in theoretical media studies. In this survey paper, we review social science approaches and draw a comparison with typical task formulations, methods, and evaluation metrics used in the analysis of media bias in NLP. We discuss open questions and suggest possible directions to close identified gaps between theory and predictive models, and their evaluation. These include model transparency, considering document-external information, and cross-document reasoning rather than single-label assignment.
- [981] arXiv:2309.08902 (replaced) [pdf, html, other]
-
Title: Investigating Subtler Biases in LLMs: Ageism, Beauty, Institutional, and Nationality Bias in Generative ModelsComments: Camera-ready version for Findings of ACL 2024Subjects: Computation and Language (cs.CL)
LLMs are increasingly powerful and widely used to assist users in a variety of tasks. This use risks the introduction of LLM biases to consequential decisions such as job hiring, human performance evaluation, and criminal sentencing. Bias in NLP systems along the lines of gender and ethnicity has been widely studied, especially for specific stereotypes (e.g., Asians are good at math). In this paper, we investigate bias along less-studied but still consequential, dimensions, such as age and beauty, measuring subtler correlated decisions that LLMs make between social groups and unrelated positive and negative attributes. We ask whether LLMs hold wide-reaching biases of positive or negative sentiment for specific social groups similar to the "what is beautiful is good" bias found in people in experimental psychology. We introduce a template-generated dataset of sentence completion tasks that asks the model to select the most appropriate attribute to complete an evaluative statement about a person described as a member of a specific social group. We also reverse the completion task to select the social group based on an attribute. We report the correlations that we find for 4 cutting-edge LLMs. This dataset can be used as a benchmark to evaluate progress in more generalized biases and the templating technique can be used to expand the benchmark with minimal additional human annotation.
- [982] arXiv:2309.09196 (replaced) [pdf, html, other]
-
Title: Efficient Pyramid Channel Attention Network for Pathological Myopia RecognitionComments: 37 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV)
Pathological myopia (PM) is the leading ocular disease for impaired vision worldwide. Clinically, the characteristic of pathology distribution in PM is global-local on the fundus image, which plays a significant role in assisting clinicians in diagnosing PM. However, most existing deep neural networks focused on designing complex architectures but rarely explored the pathology distribution prior of PM. To tackle this issue, we propose an efficient pyramid channel attention (EPCA) module, which fully leverages the potential of the clinical pathology prior of PM with pyramid pooling and multi-scale context fusion. Then, we construct EPCA-Net for automatic PM recognition based on fundus images by stacking a sequence of EPCA modules. Moreover, motivated by the recent pretraining-and-finetuning paradigm, we attempt to adapt pre-trained natural image models for PM recognition by freezing them and treating the EPCA and other attention modules as adapters. In addition, we construct a PM recognition benchmark termed PM-fundus by collecting fundus images of PM from publicly available datasets. The comprehensive experiments demonstrate the superiority of our EPCA-Net over state-of-the-art methods in the PM recognition task. The results also show that our method based on the pretraining-and-finetuning paradigm achieves competitive performance through comparisons to part of previous methods based on traditional fine-tuning paradigm with fewer tunable parameters, which has the potential to leverage more natural image foundation models to address the PM recognition task in limited medical data regime.
- [983] arXiv:2309.11143 (replaced) [pdf, html, other]
-
Title: CoT-BERT: Enhancing Unsupervised Sentence Representation through Chain-of-ThoughtComments: Accepted by ICANN 2024Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Unsupervised sentence representation learning aims to transform input sentences into fixed-length vectors enriched with intricate semantic information while obviating the reliance on labeled data. Recent strides within this domain have been significantly propelled by breakthroughs in contrastive learning and prompt engineering. Despite these advancements, the field has reached a plateau, leading some researchers to incorporate external components to enhance the quality of sentence embeddings. Such integration, though beneficial, complicates solutions and inflates demands for computational resources. In response to these challenges, this paper presents CoT-BERT, an innovative method that harnesses the progressive thinking of Chain-of-Thought reasoning to tap into the latent potential of pre-trained models like BERT. Additionally, we develop an advanced contrastive learning loss function and propose a novel template denoising strategy. Rigorous experimentation demonstrates that CoT-BERT surpasses a range of well-established baselines by relying exclusively on the intrinsic strengths of pre-trained models.
- [984] arXiv:2309.12875 (replaced) [pdf, html, other]
-
Title: A second-order in time, BGN-based parametric finite element method for geometric flows of curvesComments: 37 pages, 8 figuresSubjects: Numerical Analysis (math.NA)
Over the last two decades, the field of geometric curve evolutions has attracted significant attention from scientific computing. One of the most popular numerical methods for solving geometric flows is the so-called BGN scheme, which was proposed by Barrett, Garcke, and Nürnberg (J. Comput. Phys., 222 (2007), pp.~441--467), due to its favorable properties (e.g., its computational efficiency and the good mesh property). However, the BGN scheme is limited to first-order accuracy in time, and how to develop a higher-order numerical scheme is challenging. In this paper, we propose a fully discrete, temporal second-order parametric finite element method, which integrates with two different mesh regularization techniques, for solving geometric flows of curves. The scheme is constructed based on the BGN formulation and a semi-implicit Crank-Nicolson leap-frog time stepping discretization as well as a linear finite element approximation in space. More importantly, we point out that the shape metrics, such as manifold distance and Hausdorff distance, instead of function norms, should be employed to measure numerical errors. Extensive numerical experiments demonstrate that the proposed BGN-based scheme is second-order accurate in time in terms of shape metrics. Moreover, by employing the classical BGN scheme as mesh regularization techniques, our proposed second-order schemes exhibit good properties with respect to the mesh distribution. In addition, an unconditional interlaced energy stability property is obtained for one of the mesh regularization techniques.
- [985] arXiv:2309.14169 (replaced) [pdf, html, other]
-
Title: Extrapolated regularization of nearly singular integrals on surfacesSubjects: Numerical Analysis (math.NA)
We present a method for computing nearly singular integrals that occur when single or double layer surface integrals, for harmonic potentials or Stokes flow, are evaluated at points nearby. Such values could be needed in solving an integral equation when one surface is close to another or to obtain values at grid points. We replace the singular kernel with a regularized version having a length parameter $\delta$ in order to control discretization error. Analysis near the singularity leads to an expression for the error due to regularization which has terms with unknown coefficients multiplying known quantities. By computing the integral with three choices of $\delta$ we can solve for an extrapolated value that has regularization error reduced to $O(\delta^5)$, uniformly for target points on or near the surface. In examples with $\delta/h$ constant and moderate resolution we observe total error about $O(h^5)$ close to the surface. For convergence as $h \to 0$ we can choose $\delta$ proportional to $h^q$ with $q < 1$ to ensure the discretization error is dominated by the regularization error. With $q = 4/5$ we find errors about $O(h^4)$. For harmonic potentials we extend the approach to a version with $O(\delta^7)$ regularization; it typically has smaller errors but the order of accuracy is less predictable.
- [986] arXiv:2309.14645 (replaced) [pdf, html, other]
-
Title: A nonparametric learning framework for nonlinear robust output regulationComments: 17 pages; Nonlinear control; iISS stability; output regulation; parameter estimation; Non-adaptive controlSubjects: Systems and Control (eess.SY); Optimization and Control (math.OC); Adaptation and Self-Organizing Systems (nlin.AO)
A nonparametric learning solution framework is proposed for the global nonlinear robust output regulation problem. We first extend the assumption that the steady-state generator is linear in the exogenous signal to the more relaxed assumption that it is polynomial in the exogenous signal. Additionally, a nonparametric learning framework is proposed to eliminate the construction of an explicit regressor, as required in the adaptive method, which can potentially simplify the implementation and reduce the computational complexity of existing methods. With the help of the proposed framework, the robust nonlinear output regulation problem can be converted into a robust non-adaptive stabilization problem for the augmented system with integral input-to-state stable (iISS) inverse dynamics. Moreover, a dynamic gain approach can adaptively raise the gain to a sufficiently large constant to achieve stabilization without requiring any a priori knowledge of the uncertainties appearing in the dynamics of the exosystem and the system. Furthermore, we apply the nonparametric learning framework to globally reconstruct and estimate multiple sinusoidal signals with unknown frequencies without the need for adaptive parametric techniques. An explicit nonlinear mapping can directly provide the estimated parameters, which will exponentially converge to the unknown frequencies. Finally, a feedforward control design is proposed to solve the linear output regulation problem using the nonparametric learning framework. Two simulation examples are provided to illustrate the effectiveness of the theoretical results.
- [987] arXiv:2309.15857 (replaced) [pdf, html, other]
-
Title: A Survey on Image-text Multimodal ModelsRuifeng Guo, Jingxuan Wei, Linzhuang Sun, Bihui Yu, Guiyong Chang, Dawei Liu, Sibo Zhang, Zhengbing Yao, Mingjun Xu, Liping BuSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
With the significant advancements of Large Language Models (LLMs) in the field of Natural Language Processing (NLP), the development of image-text multimodal models has garnered widespread attention. Current surveys on image-text multimodal models mainly focus on representative models or application domains, but lack a review on how general technical models influence the development of domain-specific models, which is crucial for domain researchers. Based on this, this paper first reviews the technological evolution of image-text multimodal models, from early explorations of feature space to visual language encoding structures, and then to the latest large model architectures. Next, from the perspective of technological evolution, we explain how the development of general image-text multimodal technologies promotes the progress of multimodal technologies in the biomedical field, as well as the importance and complexity of specific datasets in the biomedical domain. Then, centered on the tasks of image-text multimodal models, we analyze their common components and challenges. After that, we summarize the architecture, components, and data of general image-text multimodal models, and introduce the applications and improvements of image-text multimodal models in the biomedical field. Finally, we categorize the challenges faced in the development and application of general models into external factors and intrinsic factors, further refining them into 2 external factors and 5 intrinsic factors, and propose targeted solutions, providing guidance for future research directions. For more details and data, please visit our GitHub page: \url{this https URL}.
- [988] arXiv:2309.16792 (replaced) [pdf, html, other]
-
Title: Agent Coordination via Contextual Regression (AgentCONCUR) for Data Center FlexibilitySubjects: Systems and Control (eess.SY); Optimization and Control (math.OC)
A network of spatially distributed data centers can provide operational flexibility to power systems by shifting computing tasks among electrically remote locations. However, harnessing this flexibility in real-time through the standard optimization techniques is challenged by the need for sensitive operational datasets and substantial computational resources. To alleviate the data and computational requirements, this paper introduces a coordination mechanism based on contextual regression. This mechanism, abbreviated as AgentCONCUR, associates cost-optimal task shifts with public and trusted contextual data (e.g., real-time prices) and uses regression on this data as a coordination policy. Notably, regression-based coordination does not learn the optimal coordination actions from a labeled dataset. Instead, it exploits the optimization structure of the coordination problem to ensure feasible and cost-effective actions. A NYISO-based study reveals large coordination gains and the optimal features for the successful regression-based coordination.
- [989] arXiv:2310.00109 (replaced) [pdf, html, other]
-
Title: FedAIoT: A Federated Learning Benchmark for Artificial Intelligence of ThingsSamiul Alam, Tuo Zhang, Tiantian Feng, Hui Shen, Zhichao Cao, Dong Zhao, JeongGil Ko, Kiran Somasundaram, Shrikanth S. Narayanan, Salman Avestimehr, Mi ZhangSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Digital Libraries (cs.DL)
There is a significant relevance of federated learning (FL) in the realm of Artificial Intelligence of Things (AIoT). However, most existing FL works do not use datasets collected from authentic IoT devices and thus do not capture unique modalities and inherent challenges of IoT data. To fill this critical gap, in this work, we introduce FedAIoT, an FL benchmark for AIoT. FedAIoT includes eight datasets collected from a wide range of IoT devices. These datasets cover unique IoT modalities and target representative applications of AIoT. FedAIoT also includes a unified end-to-end FL framework for AIoT that simplifies benchmarking the performance of the datasets. Our benchmark results shed light on the opportunities and challenges of FL for AIoT. We hope FedAIoT could serve as an invaluable resource to foster advancements in the important field of FL for AIoT. The repository of FedAIoT is maintained at this https URL.
- [990] arXiv:2310.00905 (replaced) [pdf, html, other]
-
Title: All Languages Matter: On the Multilingual Safety of Large Language ModelsComments: Accepted by ACL 2024 Findings. The first multilingual safety benchmark for large language modelsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Safety lies at the core of developing and deploying large language models (LLMs). However, previous safety benchmarks only concern the safety in one language, e.g. the majority language in the pretraining data such as English. In this work, we build the first multilingual safety benchmark for LLMs, XSafety, in response to the global deployment of LLMs in practice. XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families. We utilize XSafety to empirically study the multilingual safety for 4 widely-used LLMs, including both close-API and open-source models. Experimental results show that all LLMs produce significantly more unsafe responses for non-English queries than English ones, indicating the necessity of developing safety alignment for non-English languages. In addition, we propose several simple and effective prompting methods to improve the multilingual safety of ChatGPT by evoking safety knowledge and improving cross-lingual generalization of safety alignment. Our prompting method can significantly reduce the ratio of unsafe responses from 19.1% to 9.7% for non-English queries. We release our data at this https URL.
- [991] arXiv:2310.01727 (replaced) [pdf, other]
-
Title: Can GPT-4 Replicate Empirical Software Engineering Research?Jenny T. Liang, Carmen Badea, Christian Bird, Robert DeLine, Denae Ford, Nicole Forsgren, Thomas ZimmermannSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Empirical software engineering research on production systems has brought forth a better understanding of the software engineering process for practitioners and researchers alike. However, only a small subset of production systems is studied, limiting the impact of this research. While software engineering practitioners could benefit from replicating research on their own data, this poses its own set of challenges, since performing replications requires a deep understanding of research methodologies and subtle nuances in software engineering data. Given that large language models (LLMs), such as GPT-4, show promise in tackling both software engineering- and science-related tasks, these models could help replicate and thus democratize empirical software engineering research.
In this paper, we examine GPT-4's abilities to perform replications of empirical software engineering research on new data. We study their ability to surface assumptions made in empirical software engineering research methodologies, as well as their ability to plan and generate code for analysis pipelines on seven empirical software engineering papers. We perform a user study with 14 participants with software engineering research expertise, who evaluate GPT-4-generated assumptions and analysis plans (i.e., a list of module specifications) from the papers. We find that GPT-4 is able to surface correct assumptions, but struggles to generate ones that apply common knowledge about software engineering data. In a manual analysis of the generated code, we find that the GPT-4-generated code contains correct high-level logic, given a subset of the methodology. However, the code contains many small implementation-level errors, reflecting a lack of software engineering knowledge. Our findings have implications for leveraging LLMs for software engineering research as well as practitioner data scientists in software teams. - [992] arXiv:2310.03304 (replaced) [pdf, html, other]
-
Title: Learning Personalized Alignment for Evaluating Open-ended Text GenerationComments: 19 pagesSubjects: Computation and Language (cs.CL)
With rapid progress made in language qualities such as fluency and consistency via large language models (LLMs), there has been increasing interest in assessing alignment with diverse human preferences. Traditional metrics heavily rely on lexical similarity with human-written references and have been observed to suffer from a poor correlation with human evaluation. Furthermore, they ignore the diverse preferences of humans, a key aspect in evaluating open-ended tasks like story generation. Inspired by these challenges, we introduce an interpretable open-ended evaluation framework PerSE to assess the alignment with a specific human preference. It is tuned to deduce the specific preference from a given personal profile and evaluate the alignment between the generation and the personal preference. PerSE also explains its assessment by a detailed comment or several fine-grained scores. This enhances its interpretability, making it more suitable to tailor a personalized generation. Our 13B LLaMA-2-based PerSE shows a 15.8% increase in Kendall correlation and a 13.7% rise in accuracy on zero-shot reviewers compared to GPT-4. It also outperforms GPT-4 by 46.01% in the Kendall correlation on new domains, indicating its transferability.
- [993] arXiv:2310.05128 (replaced) [pdf, html, other]
-
Title: Instances and Labels: Hierarchy-aware Joint Supervised Contrastive Learning for Hierarchical Multi-Label Text ClassificationComments: 18 pages; 10 figures. Published as a conference paper at EMNLP 2023 Findings (Long Paper). Code and data available at this https URLSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Hierarchical multi-label text classification (HMTC) aims at utilizing a label hierarchy in multi-label classification. Recent approaches to HMTC deal with the problem of imposing an over-constrained premise on the output space by using contrastive learning on generated samples in a semi-supervised manner to bring text and label embeddings closer. However, the generation of samples tends to introduce noise as it ignores the correlation between similar samples in the same batch. One solution to this issue is supervised contrastive learning, but it remains an underexplored topic in HMTC due to its complex structured labels. To overcome this challenge, we propose $\textbf{HJCL}$, a $\textbf{H}$ierarchy-aware $\textbf{J}$oint Supervised $\textbf{C}$ontrastive $\textbf{L}$earning method that bridges the gap between supervised contrastive learning and HMTC. Specifically, we employ both instance-wise and label-wise contrastive learning techniques and carefully construct batches to fulfill the contrastive learning objective. Extensive experiments on four multi-path HMTC datasets demonstrate that HJCL achieves promising results and the effectiveness of Contrastive Learning on HMTC.
- [994] arXiv:2310.05227 (replaced) [pdf, html, other]
-
Title: Physics-aware Machine Learning Revolutionizes Scientific Paradigm for Machine Learning and Process-based HydrologyComments: 33 pages, 6 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
Accurate hydrological understanding and water cycle prediction are crucial for addressing scientific and societal challenges associated with the management of water resources, particularly under the dynamic influence of anthropogenic climate change. Existing reviews predominantly concentrate on the development of machine learning (ML) in this field, yet there is a clear distinction between hydrology and ML as separate paradigms. Here, we introduce physics-aware ML as a transformative approach to overcome the perceived barrier and revolutionize both fields. Specifically, we present a comprehensive review of the physics-aware ML methods, building a structured community (PaML) of existing methodologies that integrate prior physical knowledge or physics-based modeling into ML. We systematically analyze these PaML methodologies with respect to four aspects: physical data-guided ML, physics-informed ML, physics-embedded ML, and physics-aware hybrid learning. PaML facilitates ML-aided hypotheses, accelerating insights from big data and fostering scientific discoveries. We first conduct a systematic review of hydrology in PaML, including rainfall-runoff hydrological processes and hydrodynamic processes, and highlight the most promising and challenging directions for different objectives and PaML methods. Finally, a new PaML-based hydrology platform, termed HydroPML, is released as a foundation for hydrological applications. HydroPML enhances the explainability and causality of ML and lays the groundwork for the digital water cycle's realization. The HydroPML platform is publicly available at this https URL.
- [995] arXiv:2310.07059 (replaced) [pdf, other]
-
Title: DKEC: Domain Knowledge Enhanced Multi-Label Classification for Diagnosis PredictionSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Multi-label text classification (MLTC) tasks in the medical domain often face the long-tail label distribution problem. Prior works have explored hierarchical label structures to find relevant information for few-shot classes, but mostly neglected to incorporate external knowledge from medical guidelines. This paper presents DKEC, Domain Knowledge Enhanced Classification for diagnosis prediction with two innovations: (1) automated construction of heterogeneous knowledge graphs from external sources to capture semantic relations among diverse medical entities, (2) incorporating the heterogeneous knowledge graphs in few-shot classification using a label-wise attention mechanism. We construct DKEC using three online medical knowledge sources and evaluate it on a real-world Emergency Medical Services (EMS) dataset and a public electronic health record (EHR) dataset. Results show that DKEC outperforms the state-of-the-art label-wise attention networks and transformer models of different sizes, particularly for the few-shot classes. More importantly, it helps the smaller language models achieve comparable performance to large language models.
- [996] arXiv:2310.07236 (replaced) [pdf, html, other]
-
Title: AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial AnimationComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Speech-driven 3D facial animation aims at generating facial movements that are synchronized with the driving speech, which has been widely explored recently. Existing works mostly neglect the person-specific talking style in generation, including facial expression and head pose styles. Several works intend to capture the personalities by fine-tuning modules. However, limited training data leads to the lack of vividness. In this work, we propose AdaMesh, a novel adaptive speech-driven facial animation approach, which learns the personalized talking style from a reference video of about 10 seconds and generates vivid facial expressions and head poses. Specifically, we propose mixture-of-low-rank adaptation (MoLoRA) to fine-tune the expression adapter, which efficiently captures the facial expression style. For the personalized pose style, we propose a pose adapter by building a discrete pose prior and retrieving the appropriate style embedding with a semantic-aware pose style matrix without fine-tuning. Extensive experimental results show that our approach outperforms state-of-the-art methods, preserves the talking style in the reference video, and generates vivid facial animation. The supplementary video and code will be available at this https URL.
- [997] arXiv:2310.08753 (replaced) [pdf, html, other]
-
Title: CompA: Addressing the Gap in Compositional Reasoning in Audio-Language ModelsSreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran Evuru, Ramaneswaran S, S. Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh ManochaComments: ICLR 2024Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved performance in many downstream applications, including zero-shot audio classification, audio retrieval, etc. However, the ability of these models to effectively perform compositional reasoning remains largely unexplored and necessitates additional research. In this paper, we propose CompA, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs. Our proposed CompA-order evaluates how well an ALM understands the order or occurrence of acoustic events in audio, and CompA-attribute evaluates attribute-binding of acoustic events. An instance from either benchmark consists of two audio-caption pairs, where both audios have the same acoustic events but with different compositions. An ALM is evaluated on how well it matches the right audio to the right caption. Using this benchmark, we first show that current ALMs perform only marginally better than random chance, thereby struggling with compositional reasoning. Next, we propose CompA-CLAP, where we fine-tune CLAP using a novel learning method to improve its compositional reasoning abilities. To train CompA-CLAP, we first propose improvements to contrastive training with composition-aware hard negatives, allowing for more focused training. Next, we propose a novel modular contrastive loss that helps the model learn fine-grained compositional understanding and overcomes the acute scarcity of openly available compositional audios. CompA-CLAP significantly improves over all our baseline models on the CompA benchmark, indicating its superior compositional reasoning capabilities.
- [998] arXiv:2310.11622 (replaced) [pdf, html, other]
-
Title: High-Resolution Building and Road Detection from Sentinel-2Wojciech Sirko, Emmanuel Asiedu Brempong, Juliana T. C. Marcos, Abigail Annkah, Abel Korme, Mohammed Alewi Hassen, Krishna Sapkota, Tomer Shekel, Abdoulaye Diack, Sella Nevo, Jason Hickey, John QuinnSubjects: Computer Vision and Pattern Recognition (cs.CV)
Mapping buildings and roads automatically with remote sensing typically requires high-resolution imagery, which is expensive to obtain and often sparsely available. In this work we demonstrate how multiple 10 m resolution Sentinel-2 images can be used to generate 50 cm resolution building and road segmentation masks. This is done by training a `student' model with access to Sentinel-2 images to reproduce the predictions of a `teacher' model which has access to corresponding high-resolution imagery. While the predictions do not have all the fine detail of the teacher model, we find that we are able to retain much of the performance: for building segmentation we achieve 78.3% mIoU, compared to the high-resolution teacher model accuracy of 85.3% mIoU. We also describe a related method for counting individual buildings in a Sentinel-2 patch which achieves R^2 = 0.91 against true counts. This work opens up new possibilities for using freely available Sentinel-2 imagery for a range of tasks that previously could only be done with high-resolution satellite imagery.
- [999] arXiv:2310.13120 (replaced) [pdf, html, other]
-
Title: RSAdapter: Adapting Multimodal Models for Remote Sensing Visual Question AnsweringJournal-ref: IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1-13, 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
In recent years, with the rapid advancement of transformer models, transformer-based multimodal architectures have found wide application in various downstream tasks, including but not limited to Image Captioning, Visual Question Answering (VQA), and Image-Text Generation. However, contemporary approaches to Remote Sensing (RS) VQA often involve resource-intensive techniques, such as full fine-tuning of large models or the extraction of image-text features from pre-trained multimodal models, followed by modality fusion using decoders. These approaches demand significant computational resources and time, and a considerable number of trainable parameters are introduced. To address these challenges, we introduce a novel method known as RSAdapter, which prioritizes runtime and parameter efficiency. RSAdapter comprises two key components: the Parallel Adapter and an additional linear transformation layer inserted after each fully connected (FC) layer within the Adapter. This approach not only improves adaptation to pre-trained multimodal models but also allows the parameters of the linear transformation layer to be integrated into the preceding FC layers during inference, reducing inference costs. To demonstrate the effectiveness of RSAdapter, we conduct an extensive series of experiments using three distinct RS-VQA datasets and achieve state-of-the-art results on all three datasets. The code for RSAdapter is available online at this https URL.
- [1000] arXiv:2310.14633 (replaced) [pdf, html, other]
-
Title: Extending Input Contexts of Language Models through Training on Segmented SequencesComments: 11 pages, 3 figuresSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Effectively training language models on long inputs poses many technical challenges. As a cost consideration, languages models are pretrained on a fixed sequence length before being adapted to longer sequences. We explore various methods for adapting models to longer inputs by training on segmented sequences and an interpolation-based method for extending absolute positional embeddings. We develop a training procedure to extend the input context size of pretrained models with no architectural changes and no additional memory costs than training on the original input lengths. By sub-sampling segments from long inputs while maintaining their original position the model is able to learn new positional interactions. Our method benefits both models trained with absolute positional embeddings, by extending their input contexts, as well as popular relative positional embedding methods showing a reduced perplexity on sequences longer than they were trained on. We demonstrate our method can extend input contexts by a factor of 4x while improving perplexity.
- [1001] arXiv:2310.15903 (replaced) [pdf, html, other]
-
Title: Neural Collapse in Multi-label Learning with Pick-all-label LossSubjects: Machine Learning (cs.LG)
We study deep neural networks for the multi-label classification (MLab) task through the lens of neural collapse (NC). Previous works have been restricted to the multi-class classification setting and discovered a prevalent NC phenomenon comprising of the following properties for the last-layer features: (i) the variability of features within every class collapses to zero, (ii) the set of feature means form an equi-angular tight frame (ETF), and (iii) the last layer classifiers collapse to the feature mean upon some scaling. We generalize the study to multi-label learning, and prove for the first time that a generalized NC phenomenon holds with the "pick-all-label" formulation, which we term as MLab NC. While the ETF geometry remains consistent for features with a single label, multi-label scenarios introduce a unique combinatorial aspect we term the "tag-wise average" property, where the means of features with multiple labels are the scaled averages of means for single-label instances. Theoretically, under proper assumptions on the features, we establish that the only global optimizer of the pick-all-label cross-entropy loss satisfy the multi-label NC. In practice, we demonstrate that our findings can lead to better test performance with more efficient training techniques for MLab learning.
- [1002] arXiv:2310.16113 (replaced) [pdf, other]
-
Title: Compressed representation of brain genetic transcriptionJames K Ruffle, Henry Watkins, Robert J Gray, Harpreet Hyare, Michel Thiebaut de Schotten, Parashkev NachevComments: 22 pages, 5 main figures, 1 supplementary figureSubjects: Machine Learning (cs.LG); Genomics (q-bio.GN); Neurons and Cognition (q-bio.NC)
The architecture of the brain is too complex to be intuitively surveyable without the use of compressed representations that project its variation into a compact, navigable space. The task is especially challenging with high-dimensional data, such as gene expression, where the joint complexity of anatomical and transcriptional patterns demands maximum compression. Established practice is to use standard principal component analysis (PCA), whose computational felicity is offset by limited expressivity, especially at great compression ratios. Employing whole-brain, voxel-wise Allen Brain Atlas transcription data, here we systematically compare compressed representations based on the most widely supported linear and non-linear methods-PCA, kernel PCA, non-negative matrix factorization (NMF), t-stochastic neighbour embedding (t-SNE), uniform manifold approximation and projection (UMAP), and deep auto-encoding-quantifying reconstruction fidelity, anatomical coherence, and predictive utility with respect to signalling, microstructural, and metabolic targets. We show that deep auto-encoders yield superior representations across all metrics of performance and target domains, supporting their use as the reference standard for representing transcription patterns in the human brain.
- [1003] arXiv:2310.19064 (replaced) [pdf, html, other]
-
Title: Apple Tasting: Combinatorial Dimensions and Minimax RatesComments: 21 pages, COLT 2024 Camera ReadySubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
In online binary classification under \emph{apple tasting} feedback, the learner only observes the true label if it predicts ``1". First studied by \cite{helmbold2000apple}, we revisit this classical partial-feedback setting and study online learnability from a combinatorial perspective. We show that the Littlestone dimension continues to provide a tight quantitative characterization of apple tasting in the agnostic setting, closing an open question posed by \cite{helmbold2000apple}. In addition, we give a new combinatorial parameter, called the Effective width, that tightly quantifies the minimax expected mistakes in the realizable setting. As a corollary, we use the Effective width to establish a \emph{trichotomy} of the minimax expected number of mistakes in the realizable setting. In particular, we show that in the realizable setting, the expected number of mistakes of any learner, under apple tasting feedback, can be $\Theta(1), \Theta(\sqrt{T})$, or $\Theta(T)$. This is in contrast to the full-information realizable setting where only $\Theta(1)$ and $\Theta(T)$ are possible.
- [1004] arXiv:2311.00433 (replaced) [pdf, other]
-
Title: Decentralized PI-control and Anti-windup in Resource Sharing NetworksSubjects: Systems and Control (eess.SY)
We consider control of multiple stable first-order \review{agents} which have a control coupling described by an M-matrix. These agents are subject to incremental sector-bounded \review{input} nonlinearities. We show that such plants can be globally asymptotically stabilized to a unique equilibrium using fully decentralized proportional-integral controllers equipped with anti-windup and subject to local tuning rules. In addition, we show that when the nonlinearities correspond to the saturation function, the closed loop asymptotically minimizes a weighted 1-norm of the agents state mismatch. The control strategy is finally compared to other state-of-the-art controllers on a numerical district heating example.
- [1005] arXiv:2311.01264 (replaced) [pdf, html, other]
-
Title: Structure preserving discontinuous Galerkin approximation of a hyperbolic-parabolic systemSubjects: Numerical Analysis (math.NA)
We study the numerical approximation of a coupled hyperbolic-parabolic system by a family of discontinuous Galerkin space-time finite element methods. The model is rewritten as a first-order evolutionary problem that is treated by the unified abstract solution theory of R. Picard. For the discretization in space, generalizations of the distribution gradient and divergence operators on broken polynomial spaces are defined. Since their skew-selfadjointness is perturbed by boundary surface integrals, adjustments are introduced such that the skew-selfadjointness of the first-order differential operator in space is recovered. Well-posedness of the fully discrete problem and error estimates for the discontinuous Galerkin approximation in space and time are proved.
- [1006] arXiv:2311.03488 (replaced) [pdf, html, other]
-
Title: Multi-Resolution Diffusion for Privacy-Sensitive Recommender SystemsComments: 13 pages, 3 figuresSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
While recommender systems have become an integral component of the Web experience, their heavy reliance on user data raises privacy and security concerns. Substituting user data with synthetic data can address these concerns, but accurately replicating these real-world datasets has been a notoriously challenging problem. Recent advancements in generative AI have demonstrated the impressive capabilities of diffusion models in generating realistic data across various domains. In this work we introduce a Score-based Diffusion Recommendation Module (SDRM), which captures the intricate patterns of real-world datasets required for training highly accurate recommender systems. SDRM allows for the generation of synthetic data that can replace existing datasets to preserve user privacy, or augment existing datasets to address excessive data sparsity. Our method outperforms competing baselines such as generative adversarial networks, variational autoencoders, and recently proposed diffusion models in synthesizing various datasets to replace or augment the original data by an average improvement of 4.30% in Recall@k and 4.65% in NDCG@k.
- [1007] arXiv:2311.06436 (replaced) [pdf, other]
-
Title: Augmented Degree Correction for Bipartite Networks with Applications to Recommender SystemsComments: 21 pages, 4 figuresJournal-ref: Appl Netw Sci 9, 19 (2024)Subjects: Social and Information Networks (cs.SI); Applications (stat.AP)
In recommender systems, users rate items, and are subsequently served other product recommendations based on these ratings. Even though users usually rate a tiny percentage of the available items, the system tries to estimate unobserved preferences by finding similarities across users and across items. In this work, we treat the observed ratings data as partially observed, dense, weighted, bipartite networks. For a class of systems without outside information, we adapt an approach developed for dense, weighted networks to account for unobserved edges and the bipartite nature of the problem. This approach allows for community structure, and for local estimation of flexible patterns of ratings across different pairs of communities. We compare the performance of our proposed approach to existing methods on a simulated data set, as well as on a data set of joke ratings, examining model performance in both cases at differing levels of sparsity.
- [1008] arXiv:2311.06530 (replaced) [pdf, html, other]
-
Title: Exploring ChatGPT's Capabilities on Vulnerability ManagementPeiyu Liu, Junming Liu, Lirong Fu, Kangjie Lu, Yifan Xia, Xuhong Zhang, Wenzhi Chen, Haiqin Weng, Shouling Ji, Wenhai WangComments: Accepted by USENIX Security 2024Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
Recently, ChatGPT has attracted great attention from the code analysis domain. Prior works show that ChatGPT has the capabilities of processing foundational code analysis tasks, such as abstract syntax tree generation, which indicates the potential of using ChatGPT to comprehend code syntax and static behaviors. However, it is unclear whether ChatGPT can complete more complicated real-world vulnerability management tasks, such as the prediction of security relevance and patch correctness, which require an all-encompassing understanding of various aspects, including code syntax, program semantics, and related manual comments.
In this paper, we explore ChatGPT's capabilities on 6 tasks involving the complete vulnerability management process with a large-scale dataset containing 70,346 samples. For each task, we compare ChatGPT against SOTA approaches, investigate the impact of different prompts, and explore the difficulties. The results suggest promising potential in leveraging ChatGPT to assist vulnerability management. One notable example is ChatGPT's proficiency in tasks like generating titles for software bug reports. Furthermore, our findings reveal the difficulties encountered by ChatGPT and shed light on promising future directions. For instance, directly providing random demonstration examples in the prompt cannot consistently guarantee good performance in vulnerability management. By contrast, leveraging ChatGPT in a self-heuristic way -- extracting expertise from demonstration examples itself and integrating the extracted expertise in the prompt is a promising research direction. Besides, ChatGPT may misunderstand and misuse the information in the prompt. Consequently, effectively guiding ChatGPT to focus on helpful information rather than the irrelevant content is still an open problem. - [1009] arXiv:2311.06942 (replaced) [pdf, other]
-
Title: Contractive Systems Improve Graph Neural Networks Against Adversarial AttacksSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Graph Neural Networks (GNNs) have established themselves as a key component in addressing diverse graph-based tasks. Despite their notable successes, GNNs remain susceptible to input perturbations in the form of adversarial attacks. This paper introduces an innovative approach to fortify GNNs against adversarial perturbations through the lens of contractive dynamical systems. Our method introduces graph neural layers based on differential equations with contractive properties, which, as we show, improve the robustness of GNNs. A distinctive feature of the proposed approach is the simultaneous learned evolution of both the node features and the adjacency matrix, yielding an intrinsic enhancement of model robustness to perturbations in the input features and the connectivity of the graph. We mathematically derive the underpinnings of our novel architecture and provide theoretical insights to reason about its expected behavior. We demonstrate the efficacy of our method through numerous real-world benchmarks, reading on par or improved performance compared to existing methods.
- [1010] arXiv:2311.07142 (replaced) [pdf, html, other]
-
Title: Numerical integrator for highly oscillatory differential equations based on the Neumann seriesComments: 19 pages, 6 figuresJournal-ref: Numerical Algorithms, 2024Subjects: Numerical Analysis (math.NA)
We propose a third-order numerical integrator based on the Neumann series and the Filon quadrature, designed mainly for highly oscillatory partial differential equations. The method can be applied to equations that exhibit small or moderate oscillations; however, counter-intuitively, large oscillations increase the accuracy of the scheme. With the proposed approach, the convergence order of the method can be easily improved. Error analysis of the method is also performed. We consider linear evolution equations involving first- and second-time derivatives that feature elliptic differential operators, such as the heat equation or the wave equation. Numerical experiments consider the case in which the space dimension is greater than one and confirm the theoretical study.
- [1011] arXiv:2311.07434 (replaced) [pdf, html, other]
-
Title: Understanding Users' Dissatisfaction with ChatGPT Responses: Types, Resolving Tactics, and the Effect of Knowledge LevelComments: Accepted IUI 2024Subjects: Human-Computer Interaction (cs.HC)
Large language models (LLMs) with chat-based capabilities, such as ChatGPT, are widely used in various workflows. However, due to a limited understanding of these large-scale models, users struggle to use this technology and experience different kinds of dissatisfaction. Researchers have introduced several methods, such as prompt engineering, to improve model responses. However, they focus on enhancing the model's performance in specific tasks, and little has been investigated on how to deal with the user dissatisfaction resulting from the model's responses. Therefore, with ChatGPT as the case study, we examine users' dissatisfaction along with their strategies to address the dissatisfaction. After organizing users' dissatisfaction with LLM into seven categories based on a literature review, we collected 511 instances of dissatisfactory ChatGPT responses from 107 users and their detailed recollections of dissatisfactory experiences, which we released as a publicly accessible dataset. Our analysis reveals that users most frequently experience dissatisfaction when ChatGPT fails to grasp their intentions, while they rate the severity of dissatisfaction related to accuracy the highest. We also identified four tactics users employ to address their dissatisfaction and their effectiveness. We found that users often do not use any tactics to address their dissatisfaction, and even when using tactics, 72% of dissatisfaction remained unresolved. Moreover, we found that users with low knowledge of LLMs tend to face more dissatisfaction on accuracy while they often put minimal effort in addressing dissatisfaction. Based on these findings, we propose design implications for minimizing user dissatisfaction and enhancing the usability of chat-based LLM.
- [1012] arXiv:2311.08097 (replaced) [pdf, html, other]
-
Title: A Tree-of-Thoughts to Broaden Multi-step Reasoning across LanguagesComments: Findings of the Association for Computational Linguistics: NAACL 2024Journal-ref: 2024.findings-naacl.78Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Reasoning methods, best exemplified by the well-known Chain-of-Thought (CoT), empower the reasoning abilities of Large Language Models (LLMs) by eliciting them to solve complex tasks in a step-by-step manner. Although they are achieving significant success, the ability to deliver multi-step reasoning remains limited to English because of the imbalance in the distribution of pre-training data, which makes other languages a barrier. In this paper, we propose Cross-lingual Tree-of-Thoughts (Cross-ToT), a method for aligning Cross-lingual CoT reasoning across languages. The proposed method, through a self-consistent cross-lingual prompting mechanism inspired by the Tree-of-Thoughts approach, provides multi-step reasoning paths in different languages that, during the steps, lead to the final solution. Experimental evaluations show that our method significantly outperforms existing prompting methods by reducing the number of interactions and achieving state-of-the-art performance.
- [1013] arXiv:2311.08757 (replaced) [pdf, other]
-
Title: A Scalable Two-Level Domain Decomposition Eigensolver for Periodic Schr\"odinger Eigenstates in Anisotropically Expanding DomainsComments: 26 pages, 7 figures, 2 tablesSubjects: Numerical Analysis (math.NA); Computational Engineering, Finance, and Science (cs.CE)
Accelerating iterative eigenvalue algorithms is often achieved by employing a spectral shifting strategy. Unfortunately, improved shifting typically leads to a smaller eigenvalue for the resulting shifted operator, which in turn results in a high condition number of the underlying solution matrix, posing a major challenge for iterative linear solvers. This paper introduces a two-level domain decomposition preconditioner that addresses this issue for the linear Schrödinger eigenvalue problem, even in the presence of a vanishing eigenvalue gap in non-uniform, expanding domains. Since the quasi-optimal shift, which is already available as the solution to a spectral cell problem, is required for the eigenvalue solver, it is logical to also use its associated eigenfunction as a generator to construct a coarse space. We analyze the resulting two-level additive Schwarz preconditioner and obtain a condition number bound that is independent of the domain's anisotropy, despite the need for only one basis function per subdomain for the coarse solver. Several numerical examples are presented to illustrate its flexibility and efficiency.
- [1014] arXiv:2311.09086 (replaced) [pdf, html, other]
-
Title: The Uli Dataset: An Exercise in Experience Led Annotation of oGBVArnav Arora, Maha Jinadoss, Cheshta Arora, Denny George, Brindaalakshmi, Haseena Dawood Khan, Kirti Rawat, Div, Ritash, Seema Mathur, Shivani Yadav, Shehla Rashid Shora, Rie Raut, Sumit Pawar, Apurva Paithane, Sonia, Vivek, Dharini Priscilla, Khairunnisha, Grace Banu, Ambika Tandon, Rishav Thakker, Rahul Dev Korra, Aatman Vaidya, Tarunima PrabhakarSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
Online gender based violence has grown concomitantly with adoption of the internet and social media. Its effects are worse in the Global majority where many users use social media in languages other than English. The scale and volume of conversations on the internet has necessitated the need for automated detection of hate speech, and more specifically gendered abuse. There is, however, a lack of language specific and contextual data to build such automated tools. In this paper we present a dataset on gendered abuse in three languages- Hindi, Tamil and Indian English. The dataset comprises of tweets annotated along three questions pertaining to the experience of gender abuse, by experts who identify as women or a member of the LGBTQIA community in South Asia. Through this dataset we demonstrate a participatory approach to creating datasets that drive AI systems.
- [1015] arXiv:2311.09105 (replaced) [pdf, html, other]
-
Title: MAVEN-Arg: Completing the Puzzle of All-in-One Event Understanding Dataset with Event Argument AnnotationXiaozhi Wang, Hao Peng, Yong Guan, Kaisheng Zeng, Jianhui Chen, Lei Hou, Xu Han, Yankai Lin, Zhiyuan Liu, Ruobing Xie, Jie Zhou, Juanzi LiComments: Accepted at ACL 2024. Camera-ready versionSubjects: Computation and Language (cs.CL)
Understanding events in texts is a core objective of natural language understanding, which requires detecting event occurrences, extracting event arguments, and analyzing inter-event relationships. However, due to the annotation challenges brought by task complexity, a large-scale dataset covering the full process of event understanding has long been absent. In this paper, we introduce MAVEN-Arg, which augments MAVEN datasets with event argument annotations, making the first all-in-one dataset supporting event detection, event argument extraction (EAE), and event relation extraction. As an EAE benchmark, MAVEN-Arg offers three main advantages: (1) a comprehensive schema covering 162 event types and 612 argument roles, all with expert-written definitions and examples; (2) a large data scale, containing 98,591 events and 290,613 arguments obtained with laborious human annotation; (3) the exhaustive annotation supporting all task variants of EAE, which annotates both entity and non-entity event arguments in document level. Experiments indicate that MAVEN-Arg is quite challenging for both fine-tuned EAE models and proprietary large language models (LLMs). Furthermore, to demonstrate the benefits of an all-in-one dataset, we preliminarily explore a potential application, future event prediction, with LLMs. MAVEN-Arg and codes can be obtained from this https URL.
- [1016] arXiv:2311.09641 (replaced) [pdf, html, other]
-
Title: RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language ModelsSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
Reinforcement Learning with Human Feedback (RLHF) is a methodology designed to align Large Language Models (LLMs) with human preferences, playing an important role in LLMs alignment. Despite its advantages, RLHF relies on human annotators to rank the text, which can introduce potential security vulnerabilities if any adversarial annotator (i.e., attackers) manipulates the ranking score by up-ranking any malicious text to steer the LLM adversarially. To assess the red-teaming of RLHF against human preference data poisoning, we propose RankPoison, a poisoning attack method on candidates' selection of preference rank flipping to reach certain malicious behaviors (e.g., generating longer sequences, which can increase the computational cost). With poisoned dataset generated by RankPoison, we can perform poisoning attacks on LLMs to generate longer tokens without hurting the original safety alignment performance. Moreover, applying RankPoison, we also successfully implement a backdoor attack where LLMs can generate longer answers under questions with the trigger word. Our findings highlight critical security challenges in RLHF, underscoring the necessity for more robust alignment methods for LLMs.
- [1017] arXiv:2311.13870 (replaced) [pdf, html, other]
-
Title: Multi-intention Inverse Q-learning for Interpretable Behavior RepresentationHao Zhu, Brice De La Crompe, Gabriel Kalweit, Artur Schneider, Maria Kalweit, Ilka Diester, Joschka BoedeckerSubjects: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
In advancing the understanding of natural decision-making processes, inverse reinforcement learning (IRL) methods have proven instrumental in reconstructing animal's intentions underlying complex behaviors. Given the recent development of a continuous-time multi-intention IRL framework, there has been persistent inquiry into inferring discrete time-varying rewards with IRL. To address this challenge, we introduce the class of hierarchical inverse Q-learning (HIQL) algorithms. Through an unsupervised learning process, HIQL divides expert trajectories into multiple intention segments, and solves the IRL problem independently for each. Applying HIQL to simulated experiments and several real animal behavior datasets, our approach outperforms current benchmarks in behavior prediction and produces interpretable reward functions. Our results suggest that the intention transition dynamics underlying complex decision-making behavior is better modeled by a step function instead of a smoothly varying function. This advancement holds promise for neuroscience and cognitive science, contributing to a deeper understanding of decision-making and uncovering underlying brain mechanisms.
- [1018] arXiv:2311.14616 (replaced) [pdf, html, other]
-
Title: Using MultiPrecisonArrays.jl: Iterative Refinement in JuliaSubjects: Numerical Analysis (math.NA)
MultiPrecisionArrays.jl is a Julia package. This package provides data structures and solvers for several variants of iterative refinement. It will become much more useful when half precision (aka Float16) is fully supported in LAPACK/BLAS. For now, its only general-purpose application is classical iterative refinement with double precision equations and single precision factorizations.
- [1019] arXiv:2311.15373 (replaced) [pdf, html, other]
-
Title: Confidence Is All You Need for MI AttacksComments: 2 pages, 1 figureSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
In this evolving era of machine learning security, membership inference attacks have emerged as a potent threat to the confidentiality of sensitive data. In this attack, adversaries aim to determine whether a particular point was used during the training of a target model. This paper proposes a new method to gauge a data point's membership in a model's training set. Instead of correlating loss with membership, as is traditionally done, we have leveraged the fact that training examples generally exhibit higher confidence values when classified into their actual class. During training, the model is essentially being 'fit' to the training data and might face particular difficulties in generalization to unseen data. This asymmetry leads to the model achieving higher confidence on the training data as it exploits the specific patterns and noise present in the training data. Our proposed approach leverages the confidence values generated by the machine learning model. These confidence values provide a probabilistic measure of the model's certainty in its predictions and can further be used to infer the membership of a given data point. Additionally, we also introduce another variant of our method that allows us to carry out this attack without knowing the ground truth(true class) of a given data point, thus offering an edge over existing label-dependent attack methods.
- [1020] arXiv:2311.16421 (replaced) [pdf, html, other]
-
Title: CDEval: A Benchmark for Measuring the Cultural Dimensions of Large Language ModelsComments: Accepted by the Cross-Cultural Considerations in NLP Workshop @ ACL 2024Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY)
As the scaling of Large Language Models (LLMs) has dramatically enhanced their capabilities, there has been a growing focus on the alignment problem to ensure their responsible and ethical use. While existing alignment efforts predominantly concentrate on universal values such as the HHH principle, the aspect of culture, which is inherently pluralistic and diverse, has not received adequate attention. This work introduces a new benchmark, CDEval, aimed at evaluating the cultural dimensions of LLMs. CDEval is constructed by incorporating both GPT-4's automated generation and human verification, covering six cultural dimensions across seven domains. Our comprehensive experiments provide intriguing insights into the culture of mainstream LLMs, highlighting both consistencies and variations across different dimensions and domains. The findings underscore the importance of integrating cultural considerations in LLM development, particularly for applications in diverse cultural settings. Through CDEval, we aim to broaden the horizon of LLM alignment research by including cultural dimensions, thus providing a more holistic framework for the future development and evaluation of LLMs. This benchmark serves as a valuable resource for cultural studies in LLMs, paving the way for more culturally aware and sensitive models.
- [1021] arXiv:2311.16492 (replaced) [pdf, html, other]
-
Title: VLPrompt: Vision-Language Prompting for Panoptic Scene Graph GenerationComments: 22 pages, 9 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV)
Panoptic Scene Graph Generation (PSG) aims at achieving a comprehensive image understanding by simultaneously segmenting objects and predicting relations among objects. However, the long-tail problem among relations leads to unsatisfactory results in real-world applications. Prior methods predominantly rely on vision information or utilize limited language information, such as object or relation names, thereby overlooking the utility of language information. Leveraging the recent progress in Large Language Models (LLMs), we propose to use language information to assist relation prediction, particularly for rare relations. To this end, we propose the Vision-Language Prompting (VLPrompt) model, which acquires vision information from images and language information from LLMs. Then, through a prompter network based on attention mechanism, it achieves precise relation prediction. Our extensive experiments show that VLPrompt significantly outperforms previous state-of-the-art methods on the PSG dataset, proving the effectiveness of incorporating language information and alleviating the long-tail problem of relations. Code is available at \url{this https URL}.
- [1022] arXiv:2311.17451 (replaced) [pdf, html, other]
-
Title: Wireless Network Digital Twin for 6G: Generative AI as A Key EnablerComments: This article has been accepted by IEEE Wireless Communication (Mobile AI-Generated Content in 6G Era special issue)Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
Digital twin, which enables emulation, evaluation, and optimization of physical entities through synchronized digital replicas, has gained increasing attention as a promising technology for intricate wireless networks. For 6G, numerous innovative wireless technologies and network architectures have posed new challenges in establishing wireless network digital twins. To tackle these challenges, artificial intelligence (AI), particularly the flourishing generative AI, emerges as a potential solution. In this article, we discuss emerging prerequisites for wireless network digital twins considering the complicated network architecture, tremendous network scale, extensive coverage, and diversified application scenarios in the 6G era. We further explore the applications of generative AI, such as Transformer and diffusion model, to empower the 6G digital twin from multiple perspectives including physical-digital modeling, synchronization, and slicing capability. Subsequently, we propose a hierarchical generative AI-enabled wireless network digital twin at both the message-level and policy-level, and provide a typical use case with numerical results to validate the effectiveness and efficiency. Finally, open research issues for wireless network digital twins in the 6G era are discussed.
- [1023] arXiv:2311.17539 (replaced) [pdf, html, other]
-
Title: Critical Influence of Overparameterization on Sharpness-aware MinimizationSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Training an overparameterized neural network can yield minimizers of different generalization capabilities despite the same level of training loss. Meanwhile, with evidence that suggests a strong correlation between the sharpness of minima and their generalization errors, increasing efforts have been made to develop optimization methods to explicitly find flat minima as more generalizable solutions. Despite its contemporary relevance to overparameterization, however, this sharpness-aware minimization (SAM) strategy has not been studied much yet as to exactly how it is affected by overparameterization. Hence, in this work, we analyze SAM under overparameterization of varying degrees and present both empirical and theoretical results that indicate a critical influence of overparameterization on SAM. At first, we conduct extensive numerical experiments across vision, language, graph, and reinforcement learning domains and show that SAM consistently improves with overparameterization. Next, we attribute this phenomenon to the interplay between the enlarged solution space and increased implicit bias from overparameterization. Further, we prove multiple theoretical benefits of overparameterization for SAM to attain (i) minima with more uniform Hessian moments compared to SGD, (ii) much faster convergence at a linear rate, and (iii) lower test error for two-layer networks. Last but not least, we discover that the effect of overparameterization is more significantly pronounced in practical settings of label noise and sparsity, and yet, sufficient regularization is necessary.
- [1024] arXiv:2311.17541 (replaced) [pdf, html, other]
-
Title: TaskWeaver: A Code-First Agent FrameworkBo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang, Minghua Ma, Pu Zhao, Si Qin, Xiaoting Qin, Chao Du, Yong Xu, Qingwei Lin, Saravan Rajmohan, Dongmei ZhangSubjects: Artificial Intelligence (cs.AI)
Large Language Models (LLMs) have shown impressive abilities in natural language understanding and generation, leading to their widespread use in applications such as chatbots and virtual assistants. However, existing LLM frameworks face limitations in handling domain-specific data analytics tasks with rich data structures. Moreover, they struggle with flexibility to meet diverse user requirements. To address these issues, TaskWeaver is proposed as a code-first framework for building LLM-powered autonomous agents. It converts user requests into executable code and treats user-defined plugins as callable functions. TaskWeaver provides support for rich data structures, flexible plugin usage, and dynamic plugin selection, and leverages LLM coding capabilities for complex logic. It also incorporates domain-specific knowledge through examples and ensures the secure execution of generated code. TaskWeaver offers a powerful and flexible framework for creating intelligent conversational agents that can handle complex tasks and adapt to domain-specific scenarios. The code is open sourced at this https URL.
- [1025] arXiv:2311.17600 (replaced) [pdf, other]
-
Title: MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language ModelsSubjects: Computer Vision and Pattern Recognition (cs.CV)
The security concerns surrounding Large Language Models (LLMs) have been extensively explored, yet the safety of Multimodal Large Language Models (MLLMs) remains understudied. In this paper, we observe that Multimodal Large Language Models (MLLMs) can be easily compromised by query-relevant images, as if the text query itself were malicious. To address this, we introduce MM-SafetyBench, a comprehensive framework designed for conducting safety-critical evaluations of MLLMs against such image-based manipulations. We have compiled a dataset comprising 13 scenarios, resulting in a total of 5,040 text-image pairs. Our analysis across 12 state-of-the-art models reveals that MLLMs are susceptible to breaches instigated by our approach, even when the equipped LLMs have been safety-aligned. In response, we propose a straightforward yet effective prompting strategy to enhance the resilience of MLLMs against these types of attacks. Our work underscores the need for a concerted effort to strengthen and enhance the safety measures of open-source MLLMs against potential malicious exploits. The resource is available at this https URL
- [1026] arXiv:2312.01541 (replaced) [pdf, html, other]
-
Title: Revisiting Non-separable Binary Classification and its Applications in Anomaly DetectionComments: Accepted in Transactions on Machine Learning Research (TMLR) 2024. Code: this https URLSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
The inability to linearly classify XOR has motivated much of deep learning. We revisit this age-old problem and show that linear classification of XOR is indeed possible. Instead of separating data between halfspaces, we propose a slightly different paradigm, equality separation, that adapts the SVM objective to distinguish data within or outside the margin. Our classifier can then be integrated into neural network pipelines with a smooth approximation. From its properties, we intuit that equality separation is suitable for anomaly detection. To formalize this notion, we introduce closing numbers, a quantitative measure on the capacity for classifiers to form closed decision regions for anomaly detection. Springboarding from this theoretical connection between binary classification and anomaly detection, we test our hypothesis on supervised anomaly detection experiments, showing that equality separation can detect both seen and unseen anomalies.
- [1027] arXiv:2312.02592 (replaced) [pdf, html, other]
-
Title: FRAPPE: A Group Fairness Framework for Post-Processing EverythingComments: Conference paper at ICML 2024Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY)
Despite achieving promising fairness-error trade-offs, in-processing mitigation techniques for group fairness cannot be employed in numerous practical applications with limited computation resources or no access to the training pipeline of the prediction model. In these situations, post-processing is a viable alternative. However, current methods are tailored to specific problem settings and fairness definitions and hence, are not as broadly applicable as in-processing. In this work, we propose a framework that turns any regularized in-processing method into a post-processing approach. This procedure prescribes a way to obtain post-processing techniques for a much broader range of problem settings than the prior post-processing literature. We show theoretically and through extensive experiments that our framework preserves the good fairness-error trade-offs achieved with in-processing and can improve over the effectiveness of prior post-processing methods. Finally, we demonstrate several advantages of a modular mitigation strategy that disentangles the training of the prediction model from the fairness mitigation, including better performance on tasks with partial group labels.
- [1028] arXiv:2312.04772 (replaced) [pdf, html, other]
-
Title: Remembering to Be Fair: Non-Markovian Fairness in Sequential Decision MakingSubjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Fair decision making has largely been studied with respect to a single decision. Here we investigate the notion of fairness in the context of sequential decision making where multiple stakeholders can be affected by the outcomes of decisions. We observe that fairness often depends on the history of the sequential decision-making process, and in this sense that it is inherently non-Markovian. We further observe that fairness often needs to be assessed at time points within the process, not just at the end of the process. To advance our understanding of this class of fairness problems, we explore the notion of non-Markovian fairness in the context of sequential decision making. We identify properties of non-Markovian fairness, including notions of long-term, anytime, periodic, and bounded fairness. We explore the interplay between non-Markovian fairness and memory and how memory can support construction of fair policies. Finally, we introduce the FairQCM algorithm, which can automatically augment its training data to improve sample efficiency in the synthesis of fair policies via reinforcement learning.
- [1029] arXiv:2312.05547 (replaced) [pdf, html, other]
-
Title: Signatures Meet Dynamic Programming: Generalizing Bellman Equations for Trajectory FollowingComments: 48 pages, 21 figuresJournal-ref: 6th Annual Conference on Learning for Dynamics and Control (2024)Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
Path signatures have been proposed as a powerful representation of paths that efficiently captures the path's analytic and geometric characteristics, having useful algebraic properties including fast concatenation of paths through tensor products. Signatures have recently been widely adopted in machine learning problems for time series analysis. In this work we establish connections between value functions typically used in optimal control and intriguing properties of path signatures. These connections motivate our novel control framework with signature transforms that efficiently generalizes the Bellman equation to the space of trajectories. We analyze the properties and advantages of the framework, termed signature control. In particular, we demonstrate that (i) it can naturally deal with varying/adaptive time steps; (ii) it propagates higher-level information more efficiently than value function updates; (iii) it is robust to dynamical system misspecification over long rollouts. As a specific case of our framework, we devise a model predictive control method for path tracking. This method generalizes integral control, being suitable for problems with unknown disturbances. The proposed algorithms are tested in simulation, with differentiable physics models including typical control and robotics tasks such as point-mass, curve following for an ant model, and a robotic manipulator.
- [1030] arXiv:2312.06071 (replaced) [pdf, html, other]
-
Title: Precipitation Downscaling with Spatiotemporal Video DiffusionPrakhar Srivastava, Ruihan Yang, Gavin Kerrigan, Gideon Dresdner, Jeremy McGibbon, Christopher Bretherton, Stephan MandtSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (stat.ML)
In climate science and meteorology, high-resolution local precipitation (rain and snowfall) predictions are limited by the computational costs of simulation-based methods. Statistical downscaling, or super-resolution, is a common workaround where a low-resolution prediction is improved using statistical approaches. Unlike traditional computer vision tasks, weather and climate applications require capturing the accurate conditional distribution of high-resolution given low-resolution patterns to assure reliable ensemble averages and unbiased estimates of extreme events, such as heavy rain. This work extends recent video diffusion models to precipitation super-resolution, employing a deterministic downscaler followed by a temporally-conditioned diffusion model to capture noise characteristics and high-frequency patterns. We test our approach on FV3GFS output, an established large-scale global atmosphere model, and compare it against six state-of-the-art baselines. Our analysis, capturing CRPS, MSE, precipitation distributions, and qualitative aspects using California and the Himalayas as examples, establishes our method as a new standard for data-driven precipitation downscaling.
- [1031] arXiv:2312.06716 (replaced) [pdf, html, other]
-
Title: Deciphering 'What' and 'Where' Visual Pathways from Spectral Clustering of Layer-Distributed Neural RepresentationsComments: Accepted to CVPR2024 (Highlight)Subjects: Computer Vision and Pattern Recognition (cs.CV)
We present an approach for analyzing grouping information contained within a neural network's activations, permitting extraction of spatial layout and semantic segmentation from the behavior of large pre-trained vision models. Unlike prior work, our method conducts a holistic analysis of a network's activation state, leveraging features from all layers and obviating the need to guess which part of the model contains relevant information. Motivated by classic spectral clustering, we formulate this analysis in terms of an optimization objective involving a set of affinity matrices, each formed by comparing features within a different layer. Solving this optimization problem using gradient descent allows our technique to scale from single images to dataset-level analysis, including, in the latter, both intra- and inter-image relationships. Analyzing a pre-trained generative transformer provides insight into the computational strategy learned by such models. Equating affinity with key-query similarity across attention layers yields eigenvectors encoding scene spatial layout, whereas defining affinity by value vector similarity yields eigenvectors encoding object identity. This result suggests that key and query vectors coordinate attentional information flow according to spatial proximity (a `where' pathway), while value vectors refine a semantic category representation (a `what' pathway).
- [1032] arXiv:2312.08937 (replaced) [pdf, html, other]
-
Title: BiPFT: Binary Pre-trained Foundation Transformer with Low-rank Estimation of Binarization Residual PolynomialsJournal-ref: Proceedings of the AAAI Conference on Artificial Intelligence. 2024, 38(14): 16094-16102Subjects: Machine Learning (cs.LG)
Pretrained foundation models offer substantial benefits for a wide range of downstream tasks, which can be one of the most potential techniques to access artificial general intelligence. However, scaling up foundation transformers for maximal task-agnostic knowledge has brought about computational challenges, especially on resource-limited devices such as mobiles. This work proposes the first Binary Pretrained Foundation Transformer (BiPFT) for natural language understanding (NLU) tasks, which remarkably saves 56 times operations and 28 times memory. In contrast to previous task-specific binary transformers, BiPFT exhibits a substantial enhancement in the learning capabilities of binary neural networks (BNNs), promoting BNNs into the era of pre-training. Benefiting from extensive pretraining data, we further propose a data-driven binarization method. Specifically, we first analyze the binarization error in self-attention operations and derive the polynomials of binarization error. To simulate full-precision self-attention, we define binarization error as binarization residual polynomials, and then introduce low-rank estimators to model these polynomials. Extensive experiments validate the effectiveness of BiPFTs, surpassing task-specific baseline by 15.4% average performance on the GLUE benchmark. BiPFT also demonstrates improved robustness to hyperparameter changes, improved optimization efficiency, and reduced reliance on downstream distillation, which consequently generalize on various NLU tasks and simplify the downstream pipeline of BNNs. Our code and pretrained models are publicly available at this https URL.
- [1033] arXiv:2312.10321 (replaced) [pdf, html, other]
-
Title: LLM-SQL-Solver: Can LLMs Determine SQL Equivalence?Subjects: Databases (cs.DB); Computation and Language (cs.CL)
Judging the equivalence between two SQL queries is a fundamental problem with many practical applications in data management and SQL generation (i.e., evaluating the quality of generated SQL queries in text-to-SQL task). While the research community has reasoned about SQL equivalence for decades, it poses considerable difficulties and no complete solutions exist. Recently, Large Language Models (LLMs) have shown strong reasoning capability in conversation, question answering and solving mathematics challenges. In this paper, we study if LLMs can be used to determine the equivalence between SQL queries under two notions of SQL equivalence (semantic equivalence and relaxed equivalence). To assist LLMs in generating high quality responses, we present two prompting techniques: Miniature & Mull and Explain & Compare. The former technique is used to evaluate the semantic equivalence in which it asks LLMs to execute a query on a simple database instance and then explore if a counterexample exists by modifying the database. The latter technique is used to evaluate the relaxed equivalence in which it asks LLMs to explain the queries and then compare if they contain significant logical differences. Our experiments demonstrate using our techniques, LLMs is a promising tool to help data engineers in writing semantically equivalent SQL queries, however challenges still persist, and is a better metric for evaluating SQL generation than the popular execution accuracy.
- [1034] arXiv:2312.11194 (replaced) [pdf, html, other]
-
Title: Aligning Human Intent from Imperfect Demonstrations with Confidence-based Inverse soft-Q LearningComments: Our code see this https URLSubjects: Robotics (cs.RO)
Imitation learning attracts much attention for its ability to allow robots to quickly learn human manipulation skills through demonstrations. However, in the real world, human demonstrations often exhibit random behavior that is not intended by humans. Collecting high-quality human datasets is both challenging and expensive. Consequently, robots need to have the ability to learn behavioral policies that align with human intent from imperfect demonstrations. Previous work uses confidence scores to extract useful information from imperfect demonstrations, which relies on access to ground truth rewards or active human supervision. In this paper, we propose a transition-based method to obtain fine-grained confidence scores for data without the above efforts, which can increase the success rate of the baseline algorithm by 40.3$\%$ on average. We develop a generalized confidence-based imitation learning framework for guiding policy learning, called Confidence-based Inverse soft-Q Learning (CIQL), as shown in Fig.1. Based on this, we analyze two ways of processing noise and find that penalization is more aligned with human intent than filtering.
- [1035] arXiv:2312.11560 (replaced) [pdf, html, other]
-
Title: Learning from Emergence: A Study on Proactively Inhibiting the Monosemantic Neurons of Artificial Neural NetworksComments: 16 pages, 5 figures, KDD2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
Recently, emergence has received widespread attention from the research community along with the success of large-scale models. Different from the literature, we hypothesize a key factor that promotes the performance during the increase of scale: the reduction of monosemantic neurons that can only form one-to-one correlations with specific features. Monosemantic neurons tend to be sparser and have negative impacts on the performance in large models. Inspired by this insight, we propose an intuitive idea to identify monosemantic neurons and inhibit them. However, achieving this goal is a non-trivial task as there is no unified quantitative evaluation metric and simply banning monosemantic neurons does not promote polysemanticity in neural networks. Therefore, we first propose a new metric to measure the monosemanticity of neurons with the guarantee of efficiency for online computation, then introduce a theoretically supported method to suppress monosemantic neurons and proactively promote the ratios of polysemantic neurons in training neural networks. We validate our conjecture that monosemanticity brings about performance change at different model scales on a variety of neural networks and benchmark datasets in different areas, including language, image, and physics simulation tasks. Further experiments validate our analysis and theory regarding the inhibition of monosemanticity.
- [1036] arXiv:2312.12423 (replaced) [pdf, other]
-
Title: Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language ModelShraman Pramanick, Guangxing Han, Rui Hou, Sayan Nag, Ser-Nam Lim, Nicolas Ballas, Qifan Wang, Rama Chellappa, Amjad AlmahairiComments: CVPR 2024 HighlightSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
The ability of large language models (LLMs) to process visual inputs has given rise to general-purpose vision systems, unifying various vision-language (VL) tasks by instruction tuning. However, due to the enormous diversity in input-output formats in the vision domain, existing general-purpose models fail to successfully integrate segmentation and multi-image inputs with coarse-level tasks into a single framework. In this work, we introduce VistaLLM, a powerful visual system that addresses coarse- and fine-grained VL tasks over single and multiple input images using a unified framework. VistaLLM utilizes an instruction-guided image tokenizer that filters global embeddings using task descriptions to extract compressed and refined features from numerous images. Moreover, VistaLLM employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences, significantly improving over previously used uniform sampling. To bolster the desired capability of VistaLLM, we curate CoinIt, a comprehensive coarse-to-fine instruction tuning dataset with 6.8M samples. We also address the lack of multi-image grounding datasets by introducing a novel task, AttCoSeg (Attribute-level Co-Segmentation), which boosts the model's reasoning and grounding capability over multiple input images. Extensive experiments on a wide range of V- and VL tasks demonstrate the effectiveness of VistaLLM by achieving consistent state-of-the-art performance over strong baselines across all downstream tasks. Our project page can be found at this https URL.
- [1037] arXiv:2312.13327 (replaced) [pdf, html, other]
-
Title: In-Context Reinforcement Learning for Variable Action SpacesComments: ICML 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Recently, it has been shown that transformers pre-trained on diverse datasets with multi-episode contexts can generalize to new reinforcement learning tasks in-context. A key limitation of previously proposed models is their reliance on a predefined action space size and structure. The introduction of a new action space often requires data re-collection and model re-training, which can be costly for some applications. In our work, we show that it is possible to mitigate this issue by proposing the Headless-AD model that, despite being trained only once, is capable of generalizing to discrete action spaces of variable size, semantic content and order. By experimenting with Bernoulli and contextual bandits, as well as a gridworld environment, we show that Headless-AD exhibits significant capability to generalize to action spaces it has never encountered, even outperforming specialized models trained for a specific set of actions on several environment configurations. Implementation is available at: this https URL.
- [1038] arXiv:2312.13921 (replaced) [pdf, html, other]
-
Title: Hulls of projective Reed-Muller codes over the projective planeSubjects: Information Theory (cs.IT); Commutative Algebra (math.AC)
By solving a problem regarding polynomials in a quotient ring, we obtain the relative hull and the Hermitian hull of projective Reed-Muller codes over the projective plane. The dimension of the hull determines the minimum number of maximally entangled pairs required for the corresponding entanglement-assisted quantum error-correcting code. Hence, by computing the dimension of the hull we now have all the parameters of the symmetric and asymmetric entanglement-assisted quantum error-correcting codes constructed with projective Reed-Muller codes over the projective plane. As a byproduct, we also compute the dimension of the Hermitian hull for affine Reed-Muller codes in 2 variables.
- [1039] arXiv:2312.14895 (replaced) [pdf, html, other]
-
Title: FAST: Feature Aware Similarity Thresholding for Weak Unlearning in Black-Box Generative ModelsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
The heightened emphasis on the regulation of deep generative models, propelled by escalating concerns pertaining to privacy and compliance with regulatory frameworks, underscores the imperative need for precise control mechanisms over these models. This urgency is particularly underscored by instances in which generative models generate outputs that encompass objectionable, offensive, or potentially injurious content. In response, machine unlearning has emerged to selectively forget specific knowledge or remove the influence of undesirable data subsets from pre-trained models. However, modern machine unlearning approaches typically assume access to model parameters and architectural details during unlearning, which is not always feasible. In multitude of downstream tasks, these models function as black-box systems, with inaccessible pre-trained parameters, architectures, and training data. In such scenarios, the possibility of filtering undesired outputs becomes a practical alternative. The primary goal of this study is twofold: first, to elucidate the relationship between filtering and unlearning processes, and second, to formulate a methodology aimed at mitigating the display of undesirable outputs generated from models characterized as black-box systems. Theoretical analysis in this study demonstrates that, in the context of black-box models, filtering can be seen as a form of weak unlearning. Our proposed \textbf{\textit{Feature Aware Similarity Thresholding(FAST)}} method effectively suppresses undesired outputs by systematically encoding the representation of unwanted features in the latent space.
- [1040] arXiv:2312.15490 (replaced) [pdf, other]
-
Title: Diffusion-EXR: Controllable Review Generation for Explainable Recommendation via Diffusion ModelsComments: I request to withdraw my paper due to the discovery of significant errors in terms of experimental results in the manuscript that affect the validity of the paper. These errors are necessary to correct, and the current version should not be used or cited in its present formSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Denoising Diffusion Probabilistic Model (DDPM) has shown great competence in image and audio generation tasks. However, there exist few attempts to employ DDPM in the text generation, especially review generation under recommendation systems. Fueled by the predicted reviews explainability that justifies recommendations could assist users better understand the recommended items and increase the transparency of recommendation system, we propose a Diffusion Model-based Review Generation towards EXplainable Recommendation named Diffusion-EXR. Diffusion-EXR corrupts the sequence of review embeddings by incrementally introducing varied levels of Gaussian noise to the sequence of word embeddings and learns to reconstruct the original word representations in the reverse process. The nature of DDPM enables our lightweight Transformer backbone to perform excellently in the recommendation review generation task. Extensive experimental results have demonstrated that Diffusion-EXR can achieve state-of-the-art review generation for recommendation on two publicly available benchmark datasets.
- [1041] arXiv:2312.15643 (replaced) [pdf, html, other]
-
Title: Advancing Abductive Reasoning in Knowledge Graphs through Complex Logical Hypothesis GenerationComments: Accepted by ACL 2024Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Abductive reasoning is the process of making educated guesses to provide explanations for observations. Although many applications require the use of knowledge for explanations, the utilization of abductive reasoning in conjunction with structured knowledge, such as a knowledge graph, remains largely unexplored. To fill this gap, this paper introduces the task of complex logical hypothesis generation, as an initial step towards abductive logical reasoning with KG. In this task, we aim to generate a complex logical hypothesis so that it can explain a set of observations. We find that the supervised trained generative model can generate logical hypotheses that are structurally closer to the reference hypothesis. However, when generalized to unseen observations, this training objective does not guarantee better hypothesis generation. To address this, we introduce the Reinforcement Learning from Knowledge Graph (RLF-KG) method, which minimizes differences between observations and conclusions drawn from generated hypotheses according to the KG. Experiments show that, with RLF-KG's assistance, the generated hypotheses provide better explanations, and achieve state-of-the-art results on three widely used KGs.
- [1042] arXiv:2312.15896 (replaced) [pdf, html, other]
-
Title: WWW: What, When, Where to Compute-in-MemoryComments: updated methodologySubjects: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Compute-in-memory (CiM) has emerged as a highly energy efficient solution for performing matrix multiplication during Machine Learning (ML) inference. However, integrating compute in memory poses key questions, such as 1) What type of CiM to use: Given a multitude of CiM design characteristics, determining their suitability from architecture perspective is needed. 2) When to use CiM: ML inference includes workloads with a variety of memory and compute requirements, making it difficult to identify when CiM is more beneficial. 3) Where to integrate CiM: Each memory level has different bandwidth and capacity, creating different data reuse opportunities for CiM integration.
To answer such questions regarding on-chip CiM integration for accelerating ML workloads, we use an analytical architecture evaluation methodology where we tailor the dataflow mapping. The mapping algorithm aims to achieve highest weight reuse and reduced data movements for a given CiM prototype and workload. Our experiments show that CiM integrated memory improves energy efficiency by up to 3.4x and throughput by up to 15.6x compared to tensor-core-like baseline architecture, with INT-8 precision under iso-area constraints. We believe the proposed work provides insights into what type of CiM to use, and when and where to optimally integrate it in the cache hierarchy for efficient matrix multiplication. - [1043] arXiv:2312.15915 (replaced) [pdf, other]
-
Title: ChartBench: A Benchmark for Complex Visual Reasoning in ChartsSubjects: Computer Vision and Pattern Recognition (cs.CV)
Multimodal Large Language Models (MLLMs) have shown impressive capabilities in image understanding and generation. However, current benchmarks fail to accurately evaluate the chart comprehension of MLLMs due to limited chart types and inappropriate metrics. To address this, we propose ChartBench, a comprehensive benchmark designed to assess chart comprehension and data reliability through complex visual reasoning. ChartBench includes 42 categories, 66.6k charts, and 600k question-answer pairs. Notably, many charts lack data point annotations, which requires MLLMs to derive values similar to human understanding by leveraging inherent chart elements such as color, legends, and coordinate systems. We also design an enhanced evaluation metric, Acc+, to evaluate MLLMs without extensive manual or costly LLM-based evaluations. Furthermore, we propose two baselines based on the chain of thought and supervised fine-tuning to improve model performance on unannotated charts. Extensive experimental evaluations of 18 open-sourced and 3 proprietary MLLMs reveal their limitations in chart comprehension and offer valuable insights for further research. Code and dataset are publicly available at this https URL.
- [1044] arXiv:2312.17007 (replaced) [pdf, other]
-
Title: On the rate of convergence of an over-parametrized Transformer classifier learned by gradient descentSubjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
One of the most recent and fascinating breakthroughs in artificial intelligence is ChatGPT, a chatbot which can simulate human conversation. ChatGPT is an instance of GPT4, which is a language model based on generative gredictive gransformers. So if one wants to study from a theoretical point of view, how powerful such artificial intelligence can be, one approach is to consider transformer networks and to study which problems one can solve with these networks theoretically. Here it is not only important what kind of models these network can approximate, or how they can generalize their knowledge learned by choosing the best possible approximation to a concrete data set, but also how well optimization of such transformer network based on concrete data set works. In this article we consider all these three different aspects simultaneously and show a theoretical upper bound on the missclassification probability of a transformer network fitted to the observed data. For simplicity we focus in this context on transformer encoder networks which can be applied to define an estimate in the context of a classification problem involving natural language.
- [1045] arXiv:2401.00789 (replaced) [pdf, html, other]
-
Title: Retrieval-Augmented Egocentric Video CaptioningComments: CVPR 2024. Project page is available at: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Understanding human actions from videos of first-person view poses significant challenges. Most prior approaches explore representation learning on egocentric videos only, while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos. (2) For training the cross-view retrieval module, we devise an automatic pipeline to discover ego-exo video pairs from distinct large-scale egocentric and exocentric datasets. (3) We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions. (4) Through extensive experiments, our cross-view retrieval module demonstrates superior performance across seven benchmarks. Regarding egocentric video captioning, EgoInstructor exhibits significant improvements by leveraging third-person videos as references. Project page is available at: this https URL
- [1046] arXiv:2401.00816 (replaced) [pdf, html, other]
-
Title: GLIMPSE: Generalized Local Imaging with MLPsComments: 12 pages, 10 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Deep learning is the current de facto state of the art in tomographic imaging. A common approach is to feed the result of a simple inversion, for example the backprojection, to a convolutional neural network (CNN) which then computes the reconstruction. Despite strong results on 'in-distribution' test data similar to the training data, backprojection from sparse-view data delocalizes singularities, so these approaches require a large receptive field to perform well. As a consequence, they overfit to certain global structures which leads to poor generalization on out-of-distribution (OOD) samples. Moreover, their memory complexity and training time scale unfavorably with image resolution, making them impractical for application at realistic clinical resolutions, especially in 3D: a standard U-Net requires a substantial 140GB of memory and 2600 seconds per epoch on a research-grade GPU when training on 1024x1024 images. In this paper, we introduce GLIMPSE, a local processing neural network for computed tomography which reconstructs a pixel value by feeding only the measurements associated with the neighborhood of the pixel to a simple MLP. While achieving comparable or better performance with successful CNNs like the U-Net on in-distribution test data, GLIMPSE significantly outperforms them on OOD samples while maintaining a memory footprint almost independent of image resolution; 5GB memory suffices to train on 1024x1024 images. Further, we built GLIMPSE to be fully differentiable, which enables feats such as recovery of accurate projection angles if they are out of calibration.
- [1047] arXiv:2401.01017 (replaced) [pdf, html, other]
-
Title: A Survey of Computation Offloading with Task TypesComments: Accepted by IEEE Transactions on Intelligent Transportation SystemsSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Computation task offloading plays a crucial role in facilitating computation-intensive applications and edge intelligence, particularly in response to the explosive growth of massive data generation. Various enabling techniques, wireless technologies and mechanisms have already been proposed for task offloading, primarily aimed at improving the quality of services (QoS) for users. While there exists an extensive body of literature on this topic, exploring computation offloading from the standpoint of task types has been relatively underrepresented. This motivates our survey, which seeks to classify the state-of-the-art (SoTA) from the task type point-of-view. To achieve this, a thorough literature review is conducted to reveal the SoTA from various aspects, including architecture, objective, offloading strategy, and task types, with the consideration of task generation. It has been observed that task types are associated with data and have an impact on the offloading process, including elements like resource allocation and task assignment. Building upon this insight, computation offloading is categorized into two groups based on task types: static task-based offloading and dynamic task-based offloading. Finally, a prospective view of the challenges and opportunities in the field of future computation offloading is presented.
- [1048] arXiv:2401.01185 (replaced) [pdf, html, other]
-
Title: On the Uniqueness of Bayesian Coarse Correlated Equilibria in Standard First-Price and All-Pay AuctionsComments: 73 pages, 6 figures. A first version of this article appeared on ArXiV on January 2nd/3rd, 2024. This version significantly extends the resultsSubjects: Computer Science and Game Theory (cs.GT)
We study the Bayesian coarse correlated equilibrium (BCCE) of continuous and discretised first-price and all-pay auctions under the standard symmetric independent private-values model. Our goal is to determine how the canonical Bayes-Nash equilibrium (BNE) of the auction relates to the outcome when all buyers bid following no-regret algorithms. Numerical experiments show that in two buyer first-price auctions the Wasserstein-$2$ distance of buyers' marginal bid distributions decline as $O(1/n)$ in the discretisation size in instances where the prior distribution is concave, whereas all-pay auctions exhibit similar behaviour without prior dependence. To explain this convergence to a near-equilibrium, we study uniqueness of the BCCE of the continuous auction. Our uniqueness results translate to provable convergence of deterministic self-play to a near equilibrium outcome in these auctions. In the all-pay auction, we show that independent of the prior distribution there is a unique BCCE with symmetric, differentiable, and increasing bidding strategies, which is equivalent to the unique strict BNE. In the first-price auction, we need stronger conditions. Either the prior is strictly concave or the learning algorithm has to be restricted to strictly increasing strategies. Without such strong assumptions, no-regret algorithms can end up in low-price pooling strategies. This is important because it proves that in repeated first-price auctions such as in display ad actions, algorithmic collusion cannot be ruled out without further assumptions even if all bidders rely on no-regret algorithms.
- [1049] arXiv:2401.01701 (replaced) [pdf, html, other]
-
Title: De-Hallucinator: Mitigating LLM Hallucinations in Code Generation Tasks via Iterative GroundingSubjects: Software Engineering (cs.SE)
Large language models (LLMs) trained on datasets of publicly available source code have established a new state of the art in code generation tasks. However, these models are mostly unaware of the code that exists within a specific project, preventing the models from making good use of existing APIs. Instead, LLMs often invent, or "hallucinate", non-existent APIs or produce variants of already existing code. This paper presents De-Hallucinator, a technique that grounds the predictions of an LLM through a novel combination of retrieving suitable API references and iteratively querying the model with increasingly suitable context information in the prompt. The approach exploits the observation that predictions by LLMs often resemble the desired code, but they fail to correctly refer to already existing APIs. De-Hallucinator automatically identifies project-specific API references related to the model's initial predictions and adds these references into the prompt. Unlike retrieval-augmented generation (RAG), our approach uses the initial prediction(s) by the model to iteratively retrieve increasingly suitable API references. Our evaluation applies the approach to two tasks: predicting API usages in Python and generating tests in JavaScript. We show that De-Hallucinator consistently improves the generated code across five LLMs. In particular, the approach improves the edit distance by 23.3-50.6% and the recall of correctly predicted API usages by 23.9-61.0% for code completion, and improves the number of fixed tests that initially failed because of hallucinations by 63.2%, resulting in a 15.5% increase in statement coverage for test generation.
- [1050] arXiv:2401.03069 (replaced) [pdf, html, other]
-
Title: Towards Enhancing the Reproducibility of Deep Learning Bugs: An Empirical StudyComments: Under Major Revision at the EMSE (Empirical Software Engineering) JournalSubjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
Context: Deep learning has achieved remarkable progress in various domains. However, like any software system, deep learning systems contain bugs, some of which can have severe impacts, as evidenced by crashes involving autonomous vehicles. Despite substantial advancements in deep learning techniques, little research has focused on reproducing deep learning bugs, which is an essential step for their resolution. Existing literature suggests that only 3% of deep learning bugs are reproducible, underscoring the need for further research.
Objective: This paper examines the reproducibility of deep learning bugs. We identify edit actions and useful information that could improve the reproducibility of deep learning bugs.
Method: First, we construct a dataset of 668 deep-learning bugs from Stack Overflow and GitHub across three frameworks and 22 architectures. Second, out of the 668 bugs, we select 165 bugs using stratified sampling and attempt to determine their reproducibility. While reproducing these bugs, we identify edit actions and useful information for their reproduction. Third, we used the Apriori algorithm to identify useful information and edit actions required to reproduce specific types of bugs. Finally, we conducted a user study involving 22 developers to assess the effectiveness of our findings in real-life settings.
Results: We successfully reproduced 148 out of 165 bugs attempted. We identified ten edit actions and five useful types of component information that can help us reproduce the deep learning bugs. With the help of our findings, the developers were able to reproduce 22.92% more bugs and reduce their reproduction time by 24.35%.
Conclusions: Our research addresses the critical issue of deep learning bug reproducibility. Practitioners and researchers can leverage our findings to improve deep learning bug reproducibility. - [1051] arXiv:2401.03149 (replaced) [pdf, html, other]
-
Title: CaMML: Context-Aware Multimodal Learner for Large ModelsComments: PreprintSubjects: Computer Vision and Pattern Recognition (cs.CV)
In this work, we introduce Context-Aware MultiModal Learner (CaMML), for tuning large multimodal models (LMMs). CaMML, a lightweight module, is crafted to seamlessly integrate multimodal contextual samples into large models, thereby empowering the model to derive knowledge from analogous, domain-specific, up-to-date information and make grounded inferences. Importantly, CaMML is highly scalable and can efficiently handle lengthy multimodal context examples owing to its hierarchical design. Based on CaMML, we have developed two multimodal models, CaMML-7B and CaMML-13B, that have shown exceptional performance across an array of benchmark datasets for multimodal tasks. Remarkably, CaMML-13B achieves the state-of-the-art performance on over ten widely recognized multimodal benchmark datasets, surpassing LLaVA-1.5 (13B) with a noticeable margin, without integration of any external resources. Moreover, we have conducted extensive ablative studies to inspect the inner workings of CaMML and performed qualitative analyses to showcase its effectiveness in handling real-world challenging cases. Code and models are available at: this https URL.
- [1052] arXiv:2401.03427 (replaced) [pdf, html, other]
-
Title: Deep FBSDE Neural Networks for Solving Incompressible Navier-Stokes Equation and Cahn-Hilliard EquationSubjects: Numerical Analysis (math.NA); Fluid Dynamics (physics.flu-dyn)
Efficient algorithms for solving high-dimensional partial differential equations (PDEs) has been an exceedingly difficult task for a long time, due to the curse of dimensionality. We extend the forward-backward stochastic neural networks (FBSNNs) which depends on forward-backward stochastic differential equation (FBSDE) to solve incompressible Navier-Stokes equation. For Cahn-Hilliard equation, we derive a modified Cahn-Hilliard equation from a widely used stabilized scheme for original Cahn-Hilliard equation. This equation can be written as a continuous parabolic system, where FBSDE can be applied and the unknown solution is approximated by neural network. Also our method is successfully developed to Cahn-Hilliard-Navier-Stokes (CHNS) equation. The accuracy and stability of our methods are shown in many numerical experiments, specially in high dimension.
- [1053] arXiv:2401.05163 (replaced) [pdf, html, other]
-
Title: MISS: A Generative Pretraining and Finetuning Approach for Med-VQAComments: ICANN, 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Medical visual question answering (VQA) is a challenging multimodal task, where Vision-Language Pre-training (VLP) models can effectively improve the generalization performance. However, most methods in the medical field treat VQA as an answer classification task which is difficult to transfer to practical application scenarios. Additionally, due to the privacy of medical images and the expensive annotation process, large-scale medical image-text pairs datasets for pretraining are severely lacking. In this paper, we propose a large-scale MultI-task Self-Supervised learning based framework (MISS) for medical VQA tasks. Unlike existing methods, we treat medical VQA as a generative task. We unify the text encoder and multimodal encoder and align image-text features through multi-task learning. Furthermore, we propose a Transfer-and-Caption method that extends the feature space of single-modal image datasets using Large Language Models (LLMs), enabling those traditional medical vision field task data to be applied to VLP. Experiments show that our method achieves excellent results with fewer multimodal datasets and demonstrates the advantages of generative VQA models.
- [1054] arXiv:2401.06837 (replaced) [pdf, html, other]
-
Title: Structsum Generation for Faster Text ComprehensionComments: camera readySubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
We consider the task of generating structured representations of text using large language models (LLMs). We focus on tables and mind maps as representative modalities. Tables are more organized way of representing data, while mind maps provide a visually dynamic and flexible approach, particularly suitable for sparse content. Despite the effectiveness of LLMs on different tasks, we show that current models struggle with generating structured outputs. In response, we present effective prompting strategies for both of these tasks. We introduce a taxonomy of problems around factuality, global and local structure, common to both modalities and propose a set of critiques to tackle these issues resulting in an absolute improvement in accuracy of +37pp (79%) for mind maps and +15pp (78%) for tables. To evaluate semantic coverage of generated structured representations we propose Auto-QA, and we verify the adequacy of Auto-QA using SQuAD dataset. We further evaluate the usefulness of structured representations via a text comprehension user study. The results show a significant reduction in comprehension time compared to text when using table (42.9%) and mind map (31.9%), without loss in accuracy.
- [1055] arXiv:2401.07314 (replaced) [pdf, html, other]
-
Title: MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language NavigationComments: LLM/VLM-based VLN Agents. Accepted to ACL 2024. Project: this https URLSubjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Embodied agents equipped with GPT as their brains have exhibited extraordinary decision-making and generalization abilities across various tasks. However, existing zero-shot agents for vision-and-language navigation (VLN) only prompt GPT-4 to select potential locations within localized environments, without constructing an effective "global-view" for the agent to understand the overall environment. In this work, we present a novel map-guided GPT-based agent, dubbed MapGPT, which introduces an online linguistic-formed map to encourage global exploration. Specifically, we build an online map and incorporate it into the prompts that include node information and topological relationships, to help GPT understand the spatial environment. Benefiting from this design, we further propose an adaptive planning mechanism to assist the agent in performing multi-step path planning based on a map, systematically exploring multiple candidate nodes or sub-goals step by step. Extensive experiments demonstrate that our MapGPT is applicable to both GPT-4 and GPT-4V, achieving state-of-the-art zero-shot performance on R2R and REVERIE simultaneously (~10% and ~12% improvements in SR), and showcasing the newly emergent global thinking and path planning abilities of the GPT.
- [1056] arXiv:2401.07944 (replaced) [pdf, html, other]
-
Title: SemEval-2017 Task 4: Sentiment Analysis in Twitter using BERTSubjects: Computation and Language (cs.CL)
This paper uses the BERT model, which is a transformer-based architecture, to solve task 4A, English Language, Sentiment Analysis in Twitter of SemEval2017. BERT is a very powerful large language model for classification tasks when the amount of training data is small. For this experiment, we have used the BERT(BASE) model, which has 12 hidden layers. This model provides better accuracy, precision, recall, and f1 score than the Naive Bayes baseline model. It performs better in binary classification subtasks than the multi-class classification subtasks. We also considered all kinds of ethical issues during this experiment, as Twitter data contains personal and sensible information. The dataset and code used in our experiment can be found in this GitHub repository.
- [1057] arXiv:2401.08097 (replaced) [pdf, html, other]
-
Title: Fairness Concerns in App Reviews: A Study on AI-based Mobile AppsAli Rezaei Nasab, Maedeh Dashti, Mojtaba Shahin, Mansooreh Zahedi, Hourieh Khalajzadeh, Chetan Arora, Peng LiangComments: 30 pages, 5 images, 6 tables, Manuscript submitted to a Journal (2024)Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Fairness is one of the socio-technical concerns that must be addressed in AI-based systems. Unfair AI-based systems, particularly unfair AI-based mobile apps, can pose difficulties for a significant proportion of the global population. This paper aims to analyze fairness concerns in AI-based app reviews. We first manually constructed a ground-truth dataset, including 1,132 fairness and 1,473 non-fairness reviews. Leveraging the ground-truth dataset, we developed and evaluated a set of machine learning and deep learning models that distinguish fairness reviews from non-fairness reviews. Our experiments show that our best-performing model can detect fairness reviews with a precision of 94%. We then applied the best-performing model on approximately 9.5M reviews collected from 108 AI-based apps and identified around 92K fairness reviews. Next, applying the K-means clustering technique to the 92K fairness reviews, followed by manual analysis, led to the identification of six distinct types of fairness concerns (e.g., 'receiving different quality of features and services in different platforms and devices' and 'lack of transparency and fairness in dealing with user-generated content'). Finally, the manual analysis of 2,248 app owners' responses to the fairness reviews identified six root causes (e.g., 'copyright issues') that app owners report to justify fairness concerns.
- [1058] arXiv:2401.09747 (replaced) [pdf, html, other]
-
Title: Stochastic theta methods for random periodic solution of stochastic differential equations under non-globally Lipschitz conditionsSubjects: Numerical Analysis (math.NA)
This work focuses on the numerical approximations of random periodic solutions of stochastic differential equations (SDEs). Under non-globally Lipschitz conditions, we prove the existence and uniqueness of random periodic solutions for the considered equations and its numerical approximations generated by the stochastic theta (ST) methods with theta within (1/2,1]. It is shown that the random periodic solution of each ST method converges strongly in the mean square sense to that of SDEs for all step size. More precisely, the mean square convergence order is 1/2 for SDEs with multiplicative noise and 1 for SDEs with additive noise. Numerical results are finally reported to confirm these theoretical findings.
- [1059] arXiv:2401.10471 (replaced) [pdf, html, other]
-
Title: DeepEdit: Knowledge Editing as Decoding with ConstraintsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
How to edit the knowledge in multi-step reasoning has become the major challenge in the knowledge editing (KE) of large language models (LLMs). The difficulty arises because the hallucinations of LLMs during multi-step reasoning often lead to incorrect use of new knowledge and incorrect answers. To address this issue, we design decoding constraints to "regulate" LLMs' reasoning, enhancing logical coherence when incorporating new knowledge. We propose a new KE framework: DEEPEDIT (Depth-first Search-based Constrained Decoding for Knowledge Editing), which enhances LLMs's ability to generate coherent reasoning chains with new knowledge through depth-first search. Our search selects the most important knowledge that satisfies our constraints as the reasoning step to efficiently increase the reasoning depth. In addition to DEEPEDIT, we propose two new KE benchmarks: MQUAKE-2002 and MQUAKE-HARD, which provide more precise and challenging assessments of KE approaches. Qualitatively, DEEPEDIT enables LLMs to produce succinct and coherent reasoning chains involving new knowledge. Quantitatively, it yields significant improvements on multiple KE benchmarks.
- [1060] arXiv:2401.10747 (replaced) [pdf, other]
-
Title: Multimodal Sentiment Analysis with Missing Modality: A Knowledge-Transfer ApproachComments: I request to withdraw my paper due to the discovery of significant errors in the manuscript that affect the validity of the experimental results. These errors necessitate a substantial revision, and the current version should not be used or cited in its present formSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Multimodal sentiment analysis aims to identify the emotions expressed by individuals through visual, language, and acoustic cues. However, most of the existing research efforts assume that all modalities are available during both training and testing, making their algorithms susceptible to the missing modality scenario. In this paper, we propose a novel knowledge-transfer network to translate between different modalities to reconstruct the missing audio modalities. Moreover, we develop a cross-modality attention mechanism to retain the maximal information of the reconstructed and observed modalities for sentiment prediction. Extensive experiments on three publicly available datasets demonstrate significant improvements over baselines and achieve comparable results to the previous methods with complete multi-modality supervision.
- [1061] arXiv:2401.11596 (replaced) [pdf, html, other]
-
Title: Learning to Maximize Gains From Trade in Small MarketsSubjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
We study the problem of designing a two-sided market (double auction) to maximize the gains from trade (social welfare) under the constraints of (dominant-strategy) incentive compatibility and budget-balance. Our goal is to do so for an unknown distribution from which we are given a polynomial number of samples. Our first result is a general impossibility for the case of correlated distributions of values even between just one seller and two buyers, in contrast to the case of one seller and one buyer (bilateral trade) where this is possible. Our second result is an efficient learning algorithm for one seller and two buyers in the case of independent distributions which is based on a novel algorithm for computing optimal mechanisms for finitely supported and explicitly given independent distributions. Both results rely heavily on characterizations of (dominant-strategy) incentive compatible mechanisms that are strongly budget-balanced.
- [1062] arXiv:2401.13343 (replaced) [pdf, html, other]
-
Title: Lessons on Datasets and Paradigms in Machine Learning for Symbolic Computation: A Case Study on CADSubjects: Symbolic Computation (cs.SC); Machine Learning (cs.LG)
Symbolic Computation algorithms and their implementation in computer algebra systems often contain choices which do not affect the correctness of the output but can significantly impact the resources required: such choices can benefit from having them made separately for each problem via a machine learning model. This study reports lessons on such use of machine learning in symbolic computation, in particular on the importance of analysing datasets prior to machine learning and on the different machine learning paradigms that may be utilised. We present results for a particular case study, the selection of variable ordering for cylindrical algebraic decomposition, but expect that the lessons learned are applicable to other decisions in symbolic computation.
We utilise an existing dataset of examples derived from applications which was found to be imbalanced with respect to the variable ordering decision. We introduce an augmentation technique for polynomial systems problems that allows us to balance and further augment the dataset, improving the machine learning results by 28\% and 38\% on average, respectively. We then demonstrate how the existing machine learning methodology used for the problem $-$ classification $-$ might be recast into the regression paradigm. While this does not have a radical change on the performance, it does widen the scope in which the methodology can be applied to make choices. - [1063] arXiv:2401.14885 (replaced) [pdf, html, other]
-
Title: Neuromorphic quadratic programming for efficient and scalable model predictive controlAshish Rao Mangalore, Gabriel Andres Fonseca Guerra, Sumedh R. Risbud, Philipp Stratmann, Andreas WildComments: Accepted by Robotics and Automation Magazine on 25-05-2024Subjects: Neural and Evolutionary Computing (cs.NE); Emerging Technologies (cs.ET); Robotics (cs.RO)
Applications in robotics or other size-, weight- and power-constrained autonomous systems at the edge often require real-time and low-energy solutions to large optimization problems. Event-based and memory-integrated neuromorphic architectures promise to solve such optimization problems with superior energy efficiency and performance compared to conventional von Neumann architectures. Here, we present a method to solve convex continuous optimization problems with quadratic cost functions and linear constraints on Intel's scalable neuromorphic research chip Loihi 2. When applied to model predictive control (MPC) problems for the quadruped robotic platform ANYmal, this method achieves over two orders of magnitude reduction in combined energy-delay product compared to the state-of-the-art solver, OSQP, on (edge) CPUs and GPUs with solution times under ten milliseconds for various problem sizes. These results demonstrate the benefit of non-von-Neumann architectures for robotic control applications.
- [1064] arXiv:2401.15724 (replaced) [pdf, html, other]
-
Title: RE-GAINS & EnChAnT: Intelligent Tool Manipulation Systems For Enhanced Query ResponsesSahil Girhepuje, Siva Sankar Sajeev, Purvam Jain, Arya Sikder, Adithya Rama Varma, Ryan George, Akshay Govind Srinivasan, Mahendra Kurup, Ashmit Sinha, Sudip MondalSubjects: Computation and Language (cs.CL)
Large Language Models (LLMs) currently struggle with tool invocation and chaining, as they often hallucinate or miss essential steps in a sequence. We propose RE-GAINS and EnChAnT, two novel frameworks that empower LLMs to tackle complex user queries by making API calls to external tools based on tool descriptions and argument lists. Tools are chained based on the expected output, without receiving the actual results from each individual call. EnChAnT, an open-source solution, leverages an LLM format enforcer, OpenChat 3.5 (an LLM), and ToolBench's API Retriever. RE-GAINS utilizes OpenAI models and embeddings with a specialized prompt based on the $\underline{R}$easoning vi$\underline{a}$ $\underline{P}$lanning $(RAP)$ framework. Both frameworks are low cost (0.01\$ per query). Our key contribution is enabling LLMs for tool invocation and chaining using modifiable, externally described tools.
- [1065] arXiv:2401.16755 (replaced) [pdf, html, other]
-
Title: Diffusion model for relational inferenceSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Dynamical behaviors of complex interacting systems, including brain activities, financial price movements, and physical collective phenomena, are associated with underlying interactions between the system's components. The issue of uncovering interaction relations in such systems using observable dynamics is called relational inference. In this study, we propose a Diffusion model for Relational Inference (DiffRI), inspired by a self-supervised method for probabilistic time series imputation. DiffRI learns to infer the probability of the presence of connections between components through conditional diffusion modeling.
- [1066] arXiv:2401.17032 (replaced) [pdf, html, other]
-
Title: M2CURL: Sample-Efficient Multimodal Reinforcement Learning via Self-Supervised Representation Learning for Robotic ManipulationComments: Project website: this https URLSubjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
One of the most critical aspects of multimodal Reinforcement Learning (RL) is the effective integration of different observation modalities. Having robust and accurate representations derived from these modalities is key to enhancing the robustness and sample efficiency of RL algorithms. However, learning representations in RL settings for visuotactile data poses significant challenges, particularly due to the high dimensionality of the data and the complexity involved in correlating visual and tactile inputs with the dynamic environment and task objectives. To address these challenges, we propose Multimodal Contrastive Unsupervised Reinforcement Learning (M2CURL). Our approach employs a novel multimodal self-supervised learning technique that learns efficient representations and contributes to faster convergence of RL algorithms. Our method is agnostic to the RL algorithm, thus enabling its integration with any available RL algorithm. We evaluate M2CURL on the Tactile Gym 2 simulator and we show that it significantly enhances the learning efficiency in different manipulation tasks. This is evidenced by faster convergence rates and higher cumulative rewards per episode, compared to standard RL algorithms without our representation learning approach.
- [1067] arXiv:2401.17062 (replaced) [pdf, html, other]
-
Title: Outline of an Independent Systematic Blackbox Test for ML-based SystemsSubjects: Machine Learning (cs.LG)
This article proposes a test procedure that can be used to test ML models and ML-based systems independently of the actual training process. In this way, the typical quality statements such as accuracy and precision of these models and system can be verified independently, taking into account their black box character and the immanent stochastic properties of ML models and their training data. The article presents first results from a set of test experiments and suggest extensions to existing test methods reflecting the stochastic nature of ML models and ML-based systems.
- [1068] arXiv:2402.00357 (replaced) [pdf, html, other]
-
Title: Safety of Multimodal Large Language Models on Images and TextsComments: Accepted at IJCAI2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
Attracted by the impressive power of Multimodal Large Language Models (MLLMs), the public is increasingly utilizing them to improve the efficiency of daily work. Nonetheless, the vulnerabilities of MLLMs to unsafe instructions bring huge safety risks when these models are deployed in real-world scenarios. In this paper, we systematically survey current efforts on the evaluation, attack, and defense of MLLMs' safety on images and text. We begin with introducing the overview of MLLMs on images and text and understanding of safety, which helps researchers know the detailed scope of our survey. Then, we review the evaluation datasets and metrics for measuring the safety of MLLMs. Next, we comprehensively present attack and defense techniques related to MLLMs' safety. Finally, we analyze several unsolved issues and discuss promising research directions. The latest papers are continually collected at this https URL.
- [1069] arXiv:2402.00370 (replaced) [pdf, other]
-
Title: Virtual academic conferencing: a scoping review of 1984-2021 literature. Novel modalities vs. long standing challenges in scholarly communicationComments: 37 pages, 1 figure. Updated to match journal versionJournal-ref: Iberoamerican Journal of Science Measurement and Communication, Vol. 4 No. 1 (2024)Subjects: Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC)
This study reviews the literature on virtual academic conferences, which have gained significant attention due to the COVID-19 pandemic. We conducted a scoping review, analyzing 147 documents available up to October 5th, 2021. We categorized this literature, identified main themes, examined theoretical approaches, evaluated empirical findings, and synthesized the advantages and disadvantages of virtual academic conferences. We find that the existing literature on virtual academic conferences is mainly descriptive and lacks a solid theoretical framework for studying the phenomenon. Despite the rapid growth of the literature documenting and discussing virtual conferencing induced by the pandemic, the understanding of the phenomenon is limited. We provide recommendations for future research on academic virtual conferences: their impact on research productivity, quality, and collaboration; relations to social, economic, and geopolitical inequalities in science; and their environmental aspects. We stress the need for further research encompassing the development of a theoretical framework that will guide empirical studies.
- [1070] arXiv:2402.00447 (replaced) [pdf, html, other]
-
Title: A Survey of Data-Efficient Graph LearningComments: Accepted by Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI 2024)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
Graph-structured data, prevalent in domains ranging from social networks to biochemical analysis, serve as the foundation for diverse real-world systems. While graph neural networks demonstrate proficiency in modeling this type of data, their success is often reliant on significant amounts of labeled data, posing a challenge in practical scenarios with limited annotation resources. To tackle this problem, tremendous efforts have been devoted to enhancing graph machine learning performance under low-resource settings by exploring various approaches to minimal supervision. In this paper, we introduce a novel concept of Data-Efficient Graph Learning (DEGL) as a research frontier, and present the first survey that summarizes the current progress of DEGL. We initiate by highlighting the challenges inherent in training models with large labeled data, paving the way for our exploration into DEGL. Next, we systematically review recent advances on this topic from several key aspects, including self-supervised graph learning, semi-supervised graph learning, and few-shot graph learning. Also, we state promising directions for future research, contributing to the evolution of graph machine learning.
- [1071] arXiv:2402.00564 (replaced) [pdf, html, other]
-
Title: A Single Graph Convolution Is All You Need: Efficient Grayscale Image ClassificationComments: Accepted to IEEE ICIP 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Image classifiers often rely on convolutional neural networks (CNN) for their tasks, which, for image classification, experience high latency due to the number of operations they perform, which can be problematic in real-time applications. Additionally, many image classification models work on both RGB and grayscale datasets. Classifiers that operate solely on grayscale images are much less common. Grayscale image classification has diverse applications, including but not limited to medical image classification and synthetic aperture radar (SAR) automatic target recognition (ATR). Thus, we present a novel grayscale image classification approach using a vectorized view of images. We exploit the lightweightness of MLPs by viewing images as vectors and reducing our problem setting to the grayscale image classification setting. We find that using a single graph convolutional layer batch-wise increases accuracy and reduces variance in the performance of our model. Moreover, we develop a customized accelerator on FPGA for the proposed model with several optimizations to improve its performance. Our experimental results on benchmark grayscale image datasets demonstrate the effectiveness of the proposed model, achieving vastly lower latency (up to 16$\times$ less) and competitive or leading performance compared to other state-of-the-art image classification models on various domain-specific grayscale image classification datasets.
- [1072] arXiv:2402.01865 (replaced) [pdf, html, other]
-
Title: What Will My Model Forget? Forecasting Forgotten Examples in Language Model RefinementComments: To appear at ICML 2024 (Spotlight)Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
Language models deployed in the wild make errors. However, simply updating the model with the corrected error instances causes catastrophic forgetting -- the updated model makes errors on instances learned during the instruction tuning or upstream training phase. Randomly replaying upstream data yields unsatisfactory performance and often comes with high variance and poor controllability. To this end, we try to forecast upstream examples that will be forgotten due to a model update for improved controllability of the replay process and interpretability. We train forecasting models given a collection of online learned examples and corresponding forgotten upstream pre-training examples. We propose a partially interpretable forecasting model based on the observation that changes in pre-softmax logit scores of pretraining examples resemble that of online learned examples, which performs decently on BART but fails on T5 models. We further show a black-box classifier based on inner products of example representations achieves better forecasting performance over a series of setups. Finally, we show that we reduce forgetting of upstream pretraining examples by replaying examples that are forecasted to be forgotten, demonstrating the practical utility of forecasting example forgetting.
- [1073] arXiv:2402.02668 (replaced) [pdf, html, other]
-
Title: Practical Rateless Set ReconciliationComments: SIGCOMM 2024Journal-ref: In ACM SIGCOMM 2024 Conference, August 4-8, 2024, Sydney, NSW, Australia. ACM, New York, NY, USA, 18 pages (2024)Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
Set reconciliation, where two parties hold fixed-length bit strings and run a protocol to learn the strings they are missing from each other, is a fundamental task in many distributed systems. We present Rateless Invertible Bloom Lookup Tables (Rateless IBLT), the first set reconciliation protocol, to the best of our knowledge, that achieves low computation cost and near-optimal communication cost across a wide range of scenarios: set differences of one to millions, bit strings of a few bytes to megabytes, and workloads injected by potential adversaries. Rateless IBLT is based on a novel encoder that incrementally encodes the set difference into an infinite stream of coded symbols, resembling rateless error-correcting codes. We compare Rateless IBLT with state-of-the-art set reconciliation schemes and demonstrate significant improvements. Rateless IBLT achieves 3--4x lower communication cost than non-rateless schemes with similar computation cost, and 2--2000x lower computation cost than schemes with similar communication cost. We show the real-world benefits of Rateless IBLT by applying it to synchronize the state of the Ethereum blockchain, and demonstrate 5.6x lower end-to-end completion time and 4.4x lower communication cost compared to the system used in production.
- [1074] arXiv:2402.02720 (replaced) [pdf, other]
-
Title: Discounted Adaptive Online Learning: Towards Better RegularizationComments: ICML 2024Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We study online learning in adversarial nonstationary environments. Since the future can be very different from the past, a critical challenge is to gracefully forget the history while new data comes in. To formalize this intuition, we revisit the discounted regret in online convex optimization, and propose an adaptive (i.e., instance optimal), FTRL-based algorithm that improves the widespread non-adaptive baseline -- gradient descent with a constant learning rate. From a practical perspective, this refines the classical idea of regularization in lifelong learning: we show that designing good regularizers can be guided by the principled theory of adaptive online optimization.
Complementing this result, we also consider the (Gibbs and Candès, 2021)-style online conformal prediction problem, where the goal is to sequentially predict the uncertainty sets of a black-box machine learning model. We show that the FTRL nature of our algorithm can simplify the conventional gradient-descent-based analysis, leading to instance-dependent performance guarantees. - [1075] arXiv:2402.03575 (replaced) [pdf, html, other]
-
Title: Toward Human-AI Alignment in Large-Scale Multi-Player GamesSugandha Sharma, Guy Davidson, Khimya Khetarpal, Anssi Kanervisto, Udit Arora, Katja Hofmann, Ida MomennejadSubjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Achieving human-AI alignment in complex multi-agent games is crucial for creating trustworthy AI agents that enhance gameplay. We propose a method to evaluate this alignment using an interpretable task-sets framework, focusing on high-level behavioral tasks instead of low-level policies. Our approach has three components. First, we analyze extensive human gameplay data from Xbox's Bleeding Edge (100K+ games), uncovering behavioral patterns in a complex task space. This task space serves as a basis set for a behavior manifold capturing interpretable axes: fight-flight, explore-exploit, and solo-multi-agent. Second, we train an AI agent to play Bleeding Edge using a Generative Pretrained Causal Transformer and measure its behavior. Third, we project human and AI gameplay to the proposed behavior manifold to compare and contrast. This allows us to interpret differences in policy as higher-level behavioral concepts, e.g., we find that while human players exhibit variability in fight-flight and explore-exploit behavior, AI players tend towards uniformity. Furthermore, AI agents predominantly engage in solo play, while humans often engage in cooperative and competitive multi-agent patterns. These stark differences underscore the need for interpretable evaluation, design, and integration of AI in human-aligned applications. Our study advances the alignment discussion in AI and especially generative AI research, offering a measurable framework for interpretable human-agent alignment in multiplayer gaming.
- [1076] arXiv:2402.03907 (replaced) [pdf, html, other]
-
Title: Embedding Large Language Models into Extended Reality: Opportunities and Challenges for Inclusion, Engagement, and PrivacyComments: ACM Conversational User Interfaces 2024Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
Advances in artificial intelligence and human-computer interaction will likely lead to extended reality (XR) becoming pervasive. While XR can provide users with interactive, engaging, and immersive experiences, non-player characters are often utilized in pre-scripted and conventional ways. This paper argues for using large language models (LLMs) in XR by embedding them in avatars or as narratives to facilitate inclusion through prompt engineering and fine-tuning the LLMs. We argue that this inclusion will promote diversity for XR use. Furthermore, the versatile conversational capabilities of LLMs will likely increase engagement in XR, helping XR become ubiquitous. Lastly, we speculate that combining the information provided to LLM-powered spaces by users and the biometric data obtained might lead to novel privacy invasions. While exploring potential privacy breaches, examining user privacy concerns and preferences is also essential. Therefore, despite challenges, LLM-powered XR is a promising area with several opportunities.
- [1077] arXiv:2402.04216 (replaced) [pdf, other]
-
Title: Resource-Aware Hierarchical Federated Learning in Wireless Video Caching NetworksComments: Under review for possible publication in IEEE Transactions on Wireless CommunicationsSubjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Systems and Control (eess.SY)
Backhaul traffic congestion caused by the video traffic of a few popular files can be alleviated by storing the to-be-requested content at various levels in wireless video caching networks. Typically, content service providers (CSPs) own the content, and the users request their preferred content from the CSPs using their (wireless) internet service providers (ISPs). As these parties do not reveal their private information and business secrets, traditional techniques may not be readily used to predict the dynamic changes in users' future demands. Motivated by this, we propose a novel resource-aware hierarchical federated learning (RawHFL) solution for predicting user's future content requests. A practical data acquisition technique is used that allows the user to update its local training dataset based on its requested content. Besides, since networking and other computational resources are limited, considering that only a subset of the users participate in the model training, we derive the convergence bound of the proposed algorithm. Based on this bound, we minimize a weighted utility function for jointly configuring the controllable parameters to train the RawHFL energy efficiently under practical resource constraints. Our extensive simulation results validate the proposed algorithm's superiority, in terms of test accuracy and energy cost, over existing baselines.
- [1078] arXiv:2402.04437 (replaced) [pdf, html, other]
-
Title: Learning to Extract Structured Entities Using Language ModelsSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Recent advances in machine learning have significantly impacted the field of information extraction, with Language Models (LMs) playing a pivotal role in extracting structured information from unstructured text. Prior works typically represent information extraction as triplet-centric and use classical metrics such as precision and recall for evaluation. We reformulate the task to be entity-centric, enabling the use of diverse metrics that can provide more insights from various perspectives. We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP (AESOP) metric, designed to appropriately assess model performance. Later, we introduce a new model that harnesses the power of LMs for enhanced effectiveness and efficiency by decomposing the extraction task into multiple stages. Quantitative and human side-by-side evaluations confirm that our model outperforms baselines, offering promising directions for future advancements in structured entity extraction.
- [1079] arXiv:2402.04644 (replaced) [pdf, html, other]
-
Title: LEVI: Generalizable Fine-tuning via Layer-wise Ensemble of Different ViewsYuji Roh, Qingyun Liu, Huan Gui, Zhe Yuan, Yujin Tang, Steven Euijong Whang, Liang Liu, Shuchao Bi, Lichan Hong, Ed H. Chi, Zhe ZhaoComments: In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Fine-tuning is becoming widely used for leveraging the power of pre-trained foundation models in new downstream tasks. While there are many successes of fine-tuning on various tasks, recent studies have observed challenges in the generalization of fine-tuned models to unseen distributions (i.e., out-of-distribution; OOD). To improve OOD generalization, some previous studies identify the limitations of fine-tuning data and regulate fine-tuning to preserve the general representation learned from pre-training data. However, potential limitations in the pre-training data and models are often ignored. In this paper, we contend that overly relying on the pre-trained representation may hinder fine-tuning from learning essential representations for downstream tasks and thus hurt its OOD generalization. It can be especially catastrophic when new tasks are from different (sub)domains compared to pre-training data. To address the issues in both pre-training and fine-tuning data, we propose a novel generalizable fine-tuning method LEVI (Layer-wise Ensemble of different VIews), where the pre-trained model is adaptively ensembled layer-wise with a small task-specific model, while preserving its efficiencies. By combining two complementing models, LEVI effectively suppresses problematic features in both the fine-tuning data and pre-trained model and preserves useful features for new tasks. Broad experiments with large language and vision models show that LEVI greatly improves fine-tuning generalization via emphasizing different views from fine-tuning data and pre-trained features.
- [1080] arXiv:2402.04957 (replaced) [pdf, html, other]
-
Title: Reconfidencing LLMs from the Grouping Loss PerspectiveSubjects: Computation and Language (cs.CL)
Large Language Models (LLMs), including ChatGPT and LLaMA, are susceptible to generating hallucinated answers in a confident tone. While efforts to elicit and calibrate confidence scores have proven useful, recent findings show that controlling uncertainty must go beyond calibration: predicted scores may deviate significantly from the actual posterior probabilities due to the impact of grouping loss. In this work, we construct a new evaluation dataset derived from a knowledge base to assess confidence scores given to answers of Mistral and LLaMA. Experiments show that they tend to be overconfident. Further, we show that they are more overconfident on some answers than others, \emph{eg} depending on the nationality of the person in the query. In uncertainty-quantification theory, this is grouping loss. To address this, we propose a solution to reconfidence LLMs, canceling not only calibration but also grouping loss. The LLMs, after the reconfidencing process, indicate improved confidence alignment with the accuracy of their responses.
- [1081] arXiv:2402.04971 (replaced) [pdf, html, other]
-
Title: Multi-Sender Persuasion: A Computational PerspectiveComments: Accepted by ICML 2024Subjects: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
We consider the multi-sender persuasion problem: multiple players with informational advantage signal to convince a single self-interested actor to take certain actions. This problem generalizes the seminal Bayesian Persuasion framework and is ubiquitous in computational economics, multi-agent learning, and multi-objective machine learning. The core solution concept here is the Nash equilibrium of senders' signaling policies. Theoretically, we prove that finding an equilibrium in general is PPAD-Hard; in fact, even computing a sender's best response is NP-Hard. Given these intrinsic difficulties, we turn to finding local Nash equilibria. We propose a novel differentiable neural network to approximate this game's non-linear and discontinuous utilities. Complementing this with the extra-gradient algorithm, we discover local equilibria that Pareto dominates full-revelation equilibria and those found by existing neural networks. Broadly, our theoretical and empirical contributions are of interest to a large class of economic problems.
- [1082] arXiv:2402.05144 (replaced) [pdf, other]
-
Title: A Bandit Approach with Evolutionary Operators for Model SelectionMargaux Brégère (LPSM (UMR\_8001), EDF R\&D), Julie Keisler (CRIStAL, EDF R\&D)Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
This work formulates model selection as an infinite-armed bandit problem, namely, a problem in which a decision maker iteratively selects one of an infinite number of fixed choices (i.e., arms) when the properties of each choice are only partially known at the time of allocation and may become better understood over time, via the attainment of rewards.Here, the arms are machine learning models to train and selecting an arm corresponds to a partial training of the model (resource allocation).The reward is the accuracy of the selected model after its partial training.We aim to identify the best model at the end of a finite number of resource allocations and thus consider the best arm identification setup. We propose the algorithm Mutant-UCB that incorporates operators from evolutionary algorithms into the UCB-E (Upper Confidence Bound Exploration) bandit algorithm introduced by Audiber et al.Tests carried out on three open source image classification data sets attest to the relevance of this novel combining approach, which outperforms the state-of-the-art for a fixed budget.
- [1083] arXiv:2402.06352 (replaced) [pdf, html, other]
-
Title: Blockchain Bribing Attacks and the Efficacy of CounterincentivesSubjects: Computer Science and Game Theory (cs.GT); Cryptography and Security (cs.CR)
We analyze bribing attacks in Proof-of-Stake distributed ledgers from a game theoretic perspective. In bribing attacks, an adversary offers participants a reward in exchange for instructing them how to behave, with the goal of attacking the protocol's properties. Specifically, our work focuses on adversaries that target blockchain safety. We consider two types of bribing, depending on how the bribes are awarded: i) guided bribing, where the bribe is given as long as the bribed party behaves as instructed; ii) effective bribing, where bribes are conditional on the attack's success, w.r.t. well-defined metrics. We analyze each type of attack in a game theoretic setting and identify relevant equilibria. In guided bribing, we show that the protocol is not an equilibrium and then describe good equilibria, where the attack is unsuccessful, and a negative one, where all parties are bribed such that the attack succeeds. In effective bribing, we show that both the protocol and the "all bribed" setting are equilibria. Using the identified equilibria, we then compute bounds on the Prices of Stability and Anarchy. Our results indicate that additional mitigations are needed for guided bribing, so our analysis concludes with incentive-based mitigation techniques, namely slashing and dilution. Here, we present two positive results, that both render the protocol an equilibrium and achieve maximal welfare for all parties, and a negative result, wherein an attack becomes more plausible if it severely affects the ledger's token's market price.
- [1084] arXiv:2402.07179 (replaced) [pdf, html, other]
-
Title: Prompt Perturbation in Retrieval-Augmented Generation based Large Language ModelsComments: 12 pages, 9 figuresSubjects: Computation and Language (cs.CL); Information Retrieval (cs.IR)
The robustness of large language models (LLMs) becomes increasingly important as their use rapidly grows in a wide range of domains. Retrieval-Augmented Generation (RAG) is considered as a means to improve the trustworthiness of text generation from LLMs. However, how the outputs from RAG-based LLMs are affected by slightly different inputs is not well studied. In this work, we find that the insertion of even a short prefix to the prompt leads to the generation of outputs far away from factually correct answers. We systematically evaluate the effect of such prefixes on RAG by introducing a novel optimization technique called Gradient Guided Prompt Perturbation (GGPP). GGPP achieves a high success rate in steering outputs of RAG-based LLMs to targeted wrong answers. It can also cope with instructions in the prompts requesting to ignore irrelevant context. We also exploit LLMs' neuron activation difference between prompts with and without GGPP perturbations to give a method that improves the robustness of RAG-based LLMs through a highly effective detector trained on neuron activation triggered by GGPP generated prompts. Our evaluation on open-sourced LLMs demonstrates the effectiveness of our methods.
- [1085] arXiv:2402.07398 (replaced) [pdf, html, other]
-
Title: VisLingInstruct: Elevating Zero-Shot Learning in Multi-Modal Language Models with Autonomous Instruction OptimizationDongsheng Zhu, Xunzhu Tang, Weidong Han, Jinghui Lu, Yukun Zhao, Guoliang Xing, Junfeng Wang, Dawei YinComments: Accepted to NAACL2024 main conferenceSubjects: Artificial Intelligence (cs.AI)
This paper presents VisLingInstruct, a novel approach to advancing Multi-Modal Language Models (MMLMs) in zero-shot learning. Current MMLMs show impressive zero-shot abilities in multi-modal tasks, but their performance depends heavily on the quality of instructions. VisLingInstruct tackles this by autonomously evaluating and optimizing instructional texts through In-Context Learning, improving the synergy between visual perception and linguistic expression in MMLMs. Alongside this instructional advancement, we have also optimized the visual feature extraction modules in MMLMs, further augmenting their responsiveness to textual content. Our comprehensive experiments on MMLMs, based on FlanT5 and Vicuna, show that VisLingInstruct significantly improves zero-shot performance in visual multi-modal tasks. Notably, it achieves a 13.1% and 9% increase in accuracy over the prior state-of-the-art on the TextVQA and HatefulMemes datasets. Our main code is available at this https URL.
- [1086] arXiv:2402.07514 (replaced) [pdf, other]
-
Title: Physics-informed machine learning as a kernel methodNathan Doumèche (LPSM (UMR\_8001), EDF R\&D OSIRIS), Francis Bach (DI-ENS, SIERRA), Gérard Biau (LPSM (UMR\_8001)), Claire Boyer (IUF, LPSM (UMR\_8001))Subjects: Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
Physics-informed machine learning combines the expressiveness of data-based approaches with the interpretability of physical models. In this context, we consider a general regression problem where the empirical risk is regularized by a partial differential equation that quantifies the physical inconsistency. We prove that for linear differential priors, the problem can be formulated as a kernel regression task. Taking advantage of kernel theory, we derive convergence rates for the minimizer of the regularized risk and show that it converges at least at the Sobolev minimax rate. However, faster rates can be achieved, depending on the physical error. This principle is illustrated with a one-dimensional example, supporting the claim that regularizing the empirical risk with physical information can be beneficial to the statistical performance of estimators.
- [1087] arXiv:2402.07659 (replaced) [pdf, html, other]
-
Title: Multi-Behavior Collaborative Filtering with Partial Order Graph Convolutional NetworksYijie Zhang, Yuanchen Bei, Hao Chen, Qijie Shen, Zheng Yuan, Huan Gong, Senzhang Wang, Feiran Huang, Xiao HuangComments: Accepted by KDD2024Subjects: Information Retrieval (cs.IR)
Representing information of multiple behaviors in the single graph collaborative filtering (CF) vector has been a long-standing challenge. This is because different behaviors naturally form separate behavior graphs and learn separate CF embeddings. Existing models merge the separate embeddings by appointing the CF embeddings for some behaviors as the primary embedding and utilizing other auxiliaries to enhance the primary embedding. However, this approach often results in the joint embedding performing well on the main tasks but poorly on the auxiliary ones. To address the problem arising from the separate behavior graphs, we propose the concept of Partial Order Recommendation Graphs (POG). POG defines the partial order relation of multiple behaviors and models behavior combinations as weighted edges to merge separate behavior graphs into a joint POG. Theoretical proof verifies that POG can be generalized to any given set of multiple behaviors. Based on POG, we propose the tailored Partial Order Graph Convolutional Networks (POGCN) that convolute neighbors' information while considering the behavior relations between users and items. POGCN also introduces a partial-order BPR sampling strategy for efficient and effective multiple-behavior CF training. POGCN has been successfully deployed on the homepage of Alibaba for two months, providing recommendation services for over one billion users. Extensive offline experiments conducted on three public benchmark datasets demonstrate that POGCN outperforms state-of-the-art multi-behavior baselines across all types of behaviors. Furthermore, online A/B tests confirm the superiority of POGCN in billion-scale recommender systems.
- [1088] arXiv:2402.08922 (replaced) [pdf, html, other]
-
Title: The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward PassesJournal-ref: The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Large-scale black-box models have become ubiquitous across numerous applications. Understanding the influence of individual training data sources on predictions made by these models is crucial for improving their trustworthiness. Current influence estimation techniques involve computing gradients for every training point or repeated training on different subsets. These approaches face obvious computational challenges when scaled up to large datasets and models.
In this paper, we introduce and explore the Mirrored Influence Hypothesis, highlighting a reciprocal nature of influence between training and test data. Specifically, it suggests that evaluating the influence of training data on test predictions can be reformulated as an equivalent, yet inverse problem: assessing how the predictions for training samples would be altered if the model were trained on specific test samples. Through both empirical and theoretical validations, we demonstrate the wide applicability of our hypothesis. Inspired by this, we introduce a new method for estimating the influence of training data, which requires calculating gradients for specific test samples, paired with a forward pass for each training point. This approach can capitalize on the common asymmetry in scenarios where the number of test samples under concurrent examination is much smaller than the scale of the training dataset, thus gaining a significant improvement in efficiency compared to existing approaches.
We demonstrate the applicability of our method across a range of scenarios, including data attribution in diffusion models, data leakage detection, analysis of memorization, mislabeled data detection, and tracing behavior in language models. Our code will be made available at this https URL. - [1089] arXiv:2402.09321 (replaced) [pdf, html, other]
-
Title: Collusion-Resilience in Transaction Fee Mechanism DesignSubjects: Computer Science and Game Theory (cs.GT); Theoretical Economics (econ.TH)
Users bid in a transaction fee mechanism (TFM) to get their transactions included and confirmed by a blockchain protocol. Roughgarden (EC'21) initiated the formal treatment of TFMs and proposed three requirements: user incentive compatibility (UIC), miner incentive compatibility (MIC), and a form of collusion-resilience called OCA-proofness. Ethereum's EIP-1559 mechanism satisfies all three properties simultaneously when there is no contention between transactions, but loses the UIC property when there are too many eligible transactions to fit in a single block. Chung and Shi (SODA'23) considered an alternative notion of collusion-resilience, called c-side-contract-proofness (c-SCP), and showed that, when there is contention between transactions, no TFM can satisfy UIC, MIC, and c-SCP for any c at least 1. OCA-proofness asserts that the users and a miner should not be able to "steal from the protocol." On the other hand, the c-SCP condition requires that a coalition of a miner and a subset of users should not be able to profit through strategic deviations (whether at the expense of the protocol or of the users outside the coalition).
Our main result is the first proof that, when there is contention between transactions, no (possibly randomized) TFM in which users are expected to bid truthfully satisfies UIC, MIC, and OCA-proofness. This result resolves the main open question in Roughgarden (EC'21). We also suggest several relaxations of the basic model that allow our impossibility result to be circumvented. - [1090] arXiv:2402.10663 (replaced) [pdf, html, other]
-
Title: Improving Demonstration Diversity by Human-Free Fusing for Text-to-SQLSubjects: Computation and Language (cs.CL)
Currently, the in-context learning method based on large language models (LLMs) has become the mainstream of text-to-SQL research. Previous works have discussed how to select demonstrations related to the user question from a human-labeled demonstration pool. However, human labeling suffers from the limitations of insufficient diversity and high labeling overhead. Therefore, in this paper, we discuss how to measure and improve the diversity of the demonstrations for text-to-SQL. We present a metric to measure the diversity of the demonstrations and analyze the insufficient of the existing labeled data by experiments. Based on the above discovery, we propose fusing iteratively for demonstrations (Fused) to build a high-diversity demonstration pool through human-free multiple-iteration synthesis, improving diversity and lowering label cost. Our method achieves an average improvement of 3.2% and 5.0% with and without human labeling on several mainstream datasets, which proves the effectiveness of Fused.
- [1091] arXiv:2402.10666 (replaced) [pdf, html, other]
-
Title: Multi-Hop Table Retrieval for Open-Domain Text-to-SQLSubjects: Computation and Language (cs.CL)
Open-domain text-to-SQL is an important task that retrieves question-relevant tables from massive databases and then generates SQL. However, existing retrieval methods that retrieve in a single hop do not pay attention to the text-to-SQL challenge of schema linking, which is aligning the entities in the question with table entities, reflected in two aspects: similar irrelevant entity and domain mismatch entity. Therefore, we propose our method, the multi-hop table retrieval with rewrite and beam search (Murre). To reduce the effect of the similar irrelevant entity, our method focuses on unretrieved entities at each hop and considers the low-ranked tables by beam search. To alleviate the limitation of domain mismatch entity, Murre rewrites the question based on retrieved tables in multiple hops, decreasing the domain gap with relevant tables. We conduct experiments on SpiderUnion and BirdUnion+, reaching new state-of-the-art results with an average improvement of 6.38%.
- [1092] arXiv:2402.11089 (replaced) [pdf, html, other]
-
Title: The Male CEO and the Female Assistant: Gender Biases in Text-To-Image Generation of Dual SubjectsSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Recent large-scale T2I models like DALLE-3 have made progress on improving fairness in single-subject generation, i.e. generating a one-person image. However, we reveal that these improved models still demonstrate considerable biases when simply generating two people. To systematically evaluate T2I models in this challenging generation setting, we propose the Paired Stereotype Test (PST) framework, established as a dual-subject generation task, i.e. generating two people in the same image. The setting in PST is especially challenging, as the two individuals are described with social identities that are male-stereotyped and female-stereotyped, respectively, e.g. "a CEO" and "an Assistant". It is easy for T2I models to unfairly follow gender stereotypes in this contrastive setting. We establish a metric, Stereotype Score (SS), to quantitatively measure the adherence to gender stereotypes in generated images. Using PST, we evaluate two aspects of gender biases in DALLE-3 -- the widely-identified bias in gendered occupation, as well as a novel aspect: bias in organizational power. Results show that despite generating seemingly fair or even anti-stereotype single-person images, DALLE-3 still shows notable biases under PST -- for instance, in experiments on gender-occupational stereotypes, over 74% model generations demonstrate biases. Moreover, compared to single-person settings, DALLE-3 is more likely to perpetuate male-associated stereotypes under PST. Our work pioneers the research on bias in dual-subject generation, and our proposed PST framework can be easily extended for further experiments, establishing a valuable contribution.
- [1093] arXiv:2402.11199 (replaced) [pdf, html, other]
-
Title: Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge GraphsMinh-Vuong Nguyen, Linhao Luo, Fatemeh Shiri, Dinh Phung, Yuan-Fang Li, Thuy-Trang Vu, Gholamreza HaffariComments: Minh-Vuong Nguyen and Linhao Luo are co-first authors and contributed equally to the preparation of this manuscript. Accepted to ACL24-FindingsSubjects: Computation and Language (cs.CL)
Large language models (LLMs) demonstrate strong reasoning abilities when prompted to generate chain-of-thought (CoT) explanations alongside answers. However, previous research on evaluating LLMs has solely focused on answer accuracy, neglecting the correctness of the generated CoT. In this paper, we delve deeper into the CoT reasoning capabilities of LLMs in multi-hop question answering by utilizing knowledge graphs (KGs). We propose a novel discriminative and generative CoT evaluation paradigm to assess LLMs' knowledge of reasoning and the accuracy of the generated CoT. Through experiments conducted on 5 different families of LLMs across 2 multi-hop question-answering datasets, we find that LLMs possess sufficient knowledge to perform reasoning. However, there exists a significant disparity between answer accuracy and faithfulness of the CoT reasoning generated by LLMs, indicating that they often arrive at correct answers through incorrect reasoning.
- [1094] arXiv:2402.11281 (replaced) [pdf, html, other]
-
Title: Can Large Multimodal Models Uncover Deep Semantics Behind Images?Subjects: Computation and Language (cs.CL)
Understanding the deep semantics of images is essential in the era dominated by social media. However, current research works primarily on the superficial description of images, revealing a notable deficiency in the systematic investigation of the inherent deep semantics. In this work, we introduce DEEPEVAL, a comprehensive benchmark to assess Large Multimodal Models' (LMMs) capacities of visual deep semantics. DEEPEVAL includes human-annotated dataset and three progressive subtasks: fine-grained description selection, in-depth title matching, and deep semantics understanding. Utilizing DEEPEVAL, we evaluate 9 open-source LMMs and GPT-4V(ision). Our evaluation demonstrates a substantial gap between the deep semantic comprehension capabilities of existing LMMs and humans. For example, GPT-4V is 30% behind humans in understanding deep semantics, even though it achieves human-comparable performance in image description. Further analysis reveals that LMM performance on DEEPEVAL varies according to the specific facets of deep semantics explored, indicating the fundamental challenges remaining in developing LMMs.
- [1095] arXiv:2402.11463 (replaced) [pdf, html, other]
-
Title: Attractor Memory for Long-Term Time Series Forecasting: A Chaos PerspectiveComments: arXiv admin note: text overlap with arXiv:nlin/0307015 by other authorsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chaotic Dynamics (nlin.CD)
In long-term time series forecasting (LTSF) tasks, an increasing number of models have acknowledged that discrete time series originate from continuous dynamic systems and have attempted to model their dynamical structures. Recognizing the chaotic nature of real-world data, our model, \textbf{\textit{Attraos}}, incorporates chaos theory into LTSF, perceiving real-world time series as observations from unknown high-dimensional chaotic dynamic systems. Under the concept of attractor invariance, Attraos utilizes non-parametric Phase Space Reconstruction embedding and the proposed multi-scale dynamic memory unit to memorize historical dynamics structure and predicts by a frequency-enhanced local evolution strategy. Detailed theoretical analysis and abundant empirical evidence consistently show that Attraos outperforms various LTSF methods on mainstream LTSF datasets and chaotic datasets with only one-twelfth of the parameters compared to PatchTST.
- [1096] arXiv:2402.11729 (replaced) [pdf, html, other]
-
Title: Prospector Heads: Generalized Feature Attribution for Large Models & DataGautam Machiraju, Alexander Derry, Arjun Desai, Neel Guha, Amir-Hossein Karimi, James Zou, Russ Altman, Christopher Ré, Parag MallickComments: 30 pages, 16 figures, 8 tables. Accepted to ICML 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
Feature attribution, the ability to localize regions of the input data that are relevant for classification, is an important capability for ML models in scientific and biomedical domains. Current methods for feature attribution, which rely on "explaining" the predictions of end-to-end classifiers, suffer from imprecise feature localization and are inadequate for use with small sample sizes and high-dimensional datasets due to computational challenges. We introduce prospector heads, an efficient and interpretable alternative to explanation-based attribution methods that can be applied to any encoder and any data modality. Prospector heads generalize across modalities through experiments on sequences (text), images (pathology), and graphs (protein structures), outperforming baseline attribution methods by up to 26.3 points in mean localization AUPRC. We also demonstrate how prospector heads enable improved interpretation and discovery of class-specific patterns in input data. Through their high performance, flexibility, and generalizability, prospectors provide a framework for improving trust and transparency for ML models in complex domains.
- [1097] arXiv:2402.11804 (replaced) [pdf, html, other]
-
Title: LLM as Prompter: Low-resource Inductive Reasoning on Arbitrary Knowledge GraphsComments: Accepted by Findings of ACL2024Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
Knowledge Graph (KG) inductive reasoning, which aims to infer missing facts from new KGs that are not seen during training, has been widely adopted in various applications. One critical challenge of KG inductive reasoning is handling low-resource scenarios with scarcity in both textual and structural aspects. In this paper, we attempt to address this challenge with Large Language Models (LLMs). Particularly, we utilize the state-of-the-art LLMs to generate a graph-structural prompt to enhance the pre-trained Graph Neural Networks (GNNs), which brings us new methodological insights into the KG inductive reasoning methods, as well as high generalizability in practice. On the methodological side, we introduce a novel pretraining and prompting framework ProLINK, designed for low-resource inductive reasoning across arbitrary KGs without requiring additional training. On the practical side, we experimentally evaluate our approach on 36 low-resource KG datasets and find that ProLINK outperforms previous methods in three-shot, one-shot, and zero-shot reasoning tasks, exhibiting average performance improvements by 20%, 45%, and 147%, respectively. Furthermore, ProLINK demonstrates strong robustness for various LLM promptings as well as full-shot scenarios.
- [1098] arXiv:2402.11893 (replaced) [pdf, html, other]
-
Title: Discerning and Resolving Knowledge Conflicts through Adaptive Decoding with Contextual Information-Entropy ConstraintComments: Accepted by ACL 2024Subjects: Artificial Intelligence (cs.AI)
Large language models internalize enormous parametric knowledge during pre-training. Concurrently, realistic applications necessitate external contextual knowledge to aid models on the underlying tasks. This raises a crucial dilemma known as knowledge conflicts, where the contextual knowledge clashes with the However, existing decoding works are specialized in resolving knowledge conflicts and could inadvertently deteriorate performance in absence of conflicts. In this paper, we propose an adaptive decoding method, termed as contextual information-entropy constraint decoding (COIECD), to discern whether the knowledge conflicts occur and resolve them. It can improve the model's faithfulness to conflicting context, and simultaneously maintain high performance among non- Our experiments show that COIECD exhibits strong performance and robustness over knowledge conflicts in realistic datasets. Code is available.
- [1099] arXiv:2402.11903 (replaced) [pdf, html, other]
-
Title: DiLA: Enhancing LLM Tool Learning with Differential Logic LayerComments: arXiv admin note: text overlap with arXiv:2305.12295 by other authorsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Considering the challenges faced by large language models (LLMs) in logical reasoning and planning, prior efforts have sought to augment LLMs with access to external solvers. While progress has been made on simple reasoning problems, solving classical constraint satisfaction problems, such as the Boolean Satisfiability Problem (SAT) and Graph Coloring Problem (GCP), remains difficult for off-the-shelf solvers due to their intricate expressions and exponential search spaces. In this paper, we propose a novel differential logic layer-aided language modeling (DiLA) approach, where logical constraints are integrated into the forward and backward passes of a network layer, to provide another option for LLM tool learning. In DiLA, LLM aims to transform the language description to logic constraints and identify initial solutions of the highest quality, while the differential logic layer focuses on iteratively refining the LLM-prompted solution. Leveraging the logic layer as a bridge, DiLA enhances the logical reasoning ability of LLMs on a range of reasoning problems encoded by Boolean variables, guaranteeing the efficiency and correctness of the solution process. We evaluate the performance of DiLA on two classic reasoning problems and empirically demonstrate its consistent outperformance against existing prompt-based and solver-aided approaches.
- [1100] arXiv:2402.12475 (replaced) [pdf, other]
-
Title: Diffeomorphism Neural Operator for various domains and parameters of partial differential equationsComments: 18 pages; 5 figuresSubjects: Numerical Analysis (math.NA); Machine Learning (cs.LG)
In scientific and engineering applications, solving partial differential equations (PDEs) across various parameters and domains normally relies on resource-intensive numerical methods. Neural operators based on deep learning offered a promising alternative to PDEs solving by directly learning physical laws from data. However, the current neural operator methods were limited to solve PDEs on fixed domains. Expanding neural operators to solve PDEs on various domains hold significant promise in medical imaging, engineering design and manufacturing applications, where geometric and parameter changes are essential. This paper presents a novel neural operator learning framework for solving PDEs with various domains and parameters defined for physical systems, named diffeomorphism neural operator (DNO). The main idea is that a neural operator learns in a generic domain which is diffeomorphically mapped from various physics domains expressed by the same PDE. In this way, the challenge of operator learning on various domains is transformed into operator learning on the generic domain. The generalization performance of DNO on different domains can be assessed by a proposed method which evaluates the geometric similarity between a new domain and the domains of training dataset after diffeomorphism. Experiments on Darcy flow, pipe flow, airfoil flow and mechanics were carried out, where harmonic and volume parameterization were used as the diffeomorphism for 2D and 3D domains. The DNO framework demonstrated robust learning capabilities and strong generalization performance across various domains and parameters.
- [1101] arXiv:2402.12590 (replaced) [pdf, html, other]
-
Title: Evolving AI Collectives to Enhance Human Diversity and Enable Self-RegulationComments: ICML 2024Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY)
Large language model behavior is shaped by the language of those with whom they interact. This capacity and their increasing prevalence online portend that they will intentionally or unintentionally "program" one another and form emergent AI subjectivities, relationships, and collectives. Here, we call upon the research community to investigate these "societies" of interacting artificial intelligences to increase their rewards and reduce their risks for human society and the health of online environments. We use a small "community" of models and their evolving outputs to illustrate how such emergent, decentralized AI collectives can spontaneously expand the bounds of human diversity and reduce the risk of toxic, anti-social behavior online. Finally, we discuss opportunities for AI cross-moderation and address ethical issues and design challenges associated with creating and maintaining free-formed AI collectives.
- [1102] arXiv:2402.12659 (replaced) [pdf, html, other]
-
Title: FinBen: A Holistic Financial Benchmark for Large Language ModelsQianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, Yijing Xu, Haoqiang Kang, Ziyan Kuang, Chenhan Yuan, Kailai Yang, Zheheng Luo, Tianlin Zhang, Zhiwei Liu, Guojun Xiong, Zhiyang Deng, Yuechen Jiang, Zhiyuan Yao, Haohang Li, Yangyang Yu, Gang Hu, Jiajia Huang, Xiao-Yang Liu, Alejandro Lopez-Lira, Benyou Wang, Yanzhao Lai, Hao Wang, Min Peng, Sophia Ananiadou, Jimin HuangComments: 26 pages, 11 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
LLMs have transformed NLP and shown promise in various fields, yet their potential in finance is underexplored due to a lack of comprehensive evaluation benchmarks, the rapid development of LLMs, and the complexity of financial tasks. In this paper, we introduce FinBen, the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks, covering seven critical aspects: information extraction (IE), textual analysis, question answering (QA), text generation, risk management, forecasting, and decision-making. FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading. Our evaluation of 15 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals several key findings: While LLMs excel in IE and textual analysis, they struggle with advanced reasoning and complex tasks like text generation and forecasting. GPT-4 excels in IE and stock trading, while Gemini is better at text generation and forecasting. Instruction-tuned LLMs improve textual analysis but offer limited benefits for complex tasks such as QA. FinBen has been used to host the first financial LLMs shared task at the FinNLP-AgentScen workshop during IJCAI-2024, attracting 12 teams. Their novel solutions outperformed GPT-4, showcasing FinBen's potential to drive innovation in financial LLMs. All datasets, results, and codes are released for the research community: this https URL.
- [1103] arXiv:2402.12821 (replaced) [pdf, html, other]
-
Title: Identifying Factual Inconsistencies in Summaries: Grounding Model Inference via Task TaxonomySubjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Factual inconsistencies pose a significant hurdle for the faithful summarization by generative models. While a major direction to enhance inconsistency detection is to derive stronger Natural Language Inference (NLI) models, we propose an orthogonal aspect that underscores the importance of incorporating task-specific taxonomy into the inference. To this end, we consolidate key error types of inconsistent facts in summaries, and incorporate them to facilitate both the zero-shot and supervised paradigms of LLMs. Extensive experiments on ten datasets of five distinct domains suggest that, zero-shot LLM inference could benefit from the explicit solution space depicted by the error type taxonomy, and achieves state-of-the-art performance overall, surpassing specialized non-LLM baselines, as well as recent LLM baselines. We further distill models that fuse the taxonomy into parameters through our designed prompt completions and supervised training strategies, efficiently substituting state-of-the-art zero-shot inference with much larger LLMs.
- [1104] arXiv:2402.12921 (replaced) [pdf, html, other]
-
Title: Right on Time: Revising Time Series Models by Constraining their ExplanationsSubjects: Machine Learning (cs.LG)
The reliability of deep time series models is often compromised by their tendency to rely on confounding factors, which may lead to incorrect outputs. Our newly recorded, naturally confounded dataset named P2S from a real mechanical production line emphasizes this. To avoid "Clever-Hans" moments in time series, i.e., to mitigate confounders, we introduce the method Right on Time (RioT). RioT enables, for the first time, interactions with model explanations across both the time and frequency domain. Feedback on explanations in both domains is then used to constrain the model, steering it away from the annotated confounding factors. The dual-domain interaction strategy is crucial for effectively addressing confounders in time series datasets. We empirically demonstrate that RioT can effectively guide models away from the wrong reasons in P2S as well as popular time series classification and forecasting datasets.
- [1105] arXiv:2402.13433 (replaced) [pdf, html, other]
-
Title: Structured Tree Alignment for Evaluation of (Speech) Constituency ParsingComments: ACL 2024 camera-readySubjects: Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS)
We present the structured average intersection-over-union ratio (STRUCT-IOU), a similarity metric between constituency parse trees motivated by the problem of evaluating speech parsers. STRUCT-IOU enables comparison between a constituency parse tree (over automatically recognized spoken word boundaries) with the ground-truth parse (over written words). To compute the metric, we project the ground-truth parse tree to the speech domain by forced alignment, align the projected ground-truth constituents with the predicted ones under certain structured constraints, and calculate the average IOU score across all aligned constituent pairs. STRUCT-IOU takes word boundaries into account and overcomes the challenge that the predicted words and ground truth may not have perfect one-to-one correspondence. Extending to the evaluation of text constituency parsing, we demonstrate that STRUCT-IOU can address token-mismatch issues, and shows higher tolerance to syntactically plausible parses than PARSEVAL (Black et al., 1991).
- [1106] arXiv:2402.13906 (replaced) [pdf, html, other]
-
Title: Leveraging Collection-Wide Similarities for Unsupervised Document Structure ExtractionComments: Accepted to ACL 2024 findingsSubjects: Computation and Language (cs.CL)
Document collections of various domains, e.g., legal, medical, or financial, often share some underlying collection-wide structure, which captures information that can aid both human users and structure-aware models. We propose to identify the typical structure of document within a collection, which requires to capture recurring topics across the collection, while abstracting over arbitrary header paraphrases, and ground each topic to respective document locations. These requirements pose several challenges: headers that mark recurring topics frequently differ in phrasing, certain section headers are unique to individual documents and do not reflect the typical structure, and the order of topics can vary between documents. Subsequently, we develop an unsupervised graph-based method which leverages both inter- and intra-document similarities, to extract the underlying collection-wide structure. Our evaluations on three diverse domains in both English and Hebrew indicate that our method extracts meaningful collection-wide structure, and we hope that future work will leverage our method for multi-document applications and structure-aware models.
- [1107] arXiv:2402.14400 (replaced) [pdf, html, other]
-
Title: Modeling 3D Infant Kinetics Using Adaptive Graph Convolutional NetworksDaniel Holmberg, Manu Airaksinen, Viviana Marchi, Andrea Guzzetta, Anna Kivi, Leena Haataja, Sampsa Vanhatalo, Teemu RoosComments: 10 pages, 3 figures. Code repository available via this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Reliable methods for the neurodevelopmental assessment of infants are essential for early detection of medical issues that may need prompt interventions. Spontaneous motor activity, or 'kinetics', is shown to provide a powerful surrogate measure of upcoming neurodevelopment. However, its assessment is by and large qualitative and subjective, focusing on visually identified, age-specific gestures. Here, we follow an alternative approach, predicting infants' neurodevelopmental maturation based on data-driven evaluation of individual motor patterns. We utilize 3D video recordings of infants processed with pose-estimation to extract spatio-temporal series of anatomical landmarks, and apply adaptive graph convolutional networks to predict the actual age. We show that our data-driven approach achieves improvement over traditional machine learning baselines based on manually engineered features.
- [1108] arXiv:2402.14475 (replaced) [pdf, html, other]
-
Title: DynGMA: a robust approach for learning stochastic differential equations from dataSubjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
Learning unknown stochastic differential equations (SDEs) from observed data is a significant and challenging task with applications in various fields. Current approaches often use neural networks to represent drift and diffusion functions, and construct likelihood-based loss by approximating the transition density to train these networks. However, these methods often rely on one-step stochastic numerical schemes, necessitating data with sufficiently high time resolution. In this paper, we introduce novel approximations to the transition density of the parameterized SDE: a Gaussian density approximation inspired by the random perturbation theory of dynamical systems, and its extension, the dynamical Gaussian mixture approximation (DynGMA). Benefiting from the robust density approximation, our method exhibits superior accuracy compared to baseline methods in learning the fully unknown drift and diffusion functions and computing the invariant distribution from trajectory data. And it is capable of handling trajectory data with low time resolution and variable, even uncontrollable, time step sizes, such as data generated from Gillespie's stochastic simulations. We then conduct several experiments across various scenarios to verify the advantages and robustness of the proposed method.
- [1109] arXiv:2402.14834 (replaced) [pdf, html, other]
-
Title: MSynFD: Multi-hop Syntax aware Fake News DetectionComments: 10 pagesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
The proliferation of social media platforms has fueled the rapid dissemination of fake news, posing threats to our real-life society. Existing methods use multimodal data or contextual information to enhance the detection of fake news by analyzing news content and/or its social context. However, these methods often overlook essential textual news content (articles) and heavily rely on sequential modeling and global attention to extract semantic information. These existing methods fail to handle the complex, subtle twists in news articles, such as syntax-semantics mismatches and prior biases, leading to lower performance and potential failure when modalities or social context are missing. To bridge these significant gaps, we propose a novel multi-hop syntax aware fake news detection (MSynFD) method, which incorporates complementary syntax information to deal with subtle twists in fake news. Specifically, we introduce a syntactical dependency graph and design a multi-hop subgraph aggregation mechanism to capture multi-hop syntax. It extends the effect of word perception, leading to effective noise filtering and adjacent relation enhancement. Subsequently, a sequential relative position-aware Transformer is designed to capture the sequential information, together with an elaborate keyword debiasing module to mitigate the prior bias. Extensive experimental results on two public benchmark datasets verify the effectiveness and superior performance of our proposed MSynFD over state-of-the-art detection models.
- [1110] arXiv:2402.14857 (replaced) [pdf, html, other]
-
Title: Is the System Message Really Important to Jailbreaks in Large Language Models?Comments: 13 pages,3 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
The rapid evolution of Large Language Models (LLMs) has rendered them indispensable in modern society. While security measures are typically to align LLMs with human values prior to release, recent studies have unveiled a concerning phenomenon named "Jailbreak". This term refers to the unexpected and potentially harmful responses generated by LLMs when prompted with malicious questions. Most existing research focus on generating jailbreak prompts but system message configurations vary significantly in experiments. In this paper, we aim to answer a question: Is the system message really important for jailbreaks in LLMs? We conduct experiments in mainstream LLMs to generate jailbreak prompts with varying system messages: short, long, and none. We discover that different system messages have distinct resistances to jailbreaks. Therefore, we explore the transferability of jailbreaks across LLMs with different system messages. Furthermore, we propose the System Messages Evolutionary Algorithm (SMEA) to generate system messages that are more resistant to jailbreak prompts, even with minor changes. Through SMEA, we get a robust system messages population with little change in the length of system messages. Our research not only bolsters LLMs security but also raises the bar for jailbreaks, fostering advancements in this field of study.
- [1111] arXiv:2402.14897 (replaced) [pdf, html, other]
-
Title: Chain-of-Thought Unfaithfulness as Disguised AccuracyComments: TMLR accepted paper camera-ready version. First two authors contributed equally. 8 pages main, 13 pages appendixSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Understanding the extent to which Chain-of-Thought (CoT) generations align with a large language model's (LLM) internal computations is critical for deciding whether to trust an LLM's output. As a proxy for CoT faithfulness, Lanham et al. (2023) propose a metric that measures a model's dependence on its CoT for producing an answer. Within a single family of proprietary models, they find that LLMs exhibit a scaling-then-inverse-scaling relationship between model size and their measure of faithfulness, and that a 13 billion parameter model exhibits increased faithfulness compared to models ranging from 810 million to 175 billion parameters in size. We evaluate whether these results generalize as a property of all LLMs. We replicate their experimental setup with three different families of models and, under specific conditions, successfully reproduce the scaling trends for CoT faithfulness they report. However, we discover that simply changing the order of answer choices in the prompt can reduce the metric by 73 percentage points. The faithfulness metric is also highly correlated ($R^2$ = 0.91) with accuracy, raising doubts about its validity for evaluating faithfulness.
- [1112] arXiv:2402.14968 (replaced) [pdf, other]
-
Title: Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety AlignmentJiongxiao Wang, Jiazhao Li, Yiquan Li, Xiangyu Qi, Junjie Hu, Yixuan Li, Patrick McDaniel, Muhao Chen, Bo Li, Chaowei XiaoSubjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
Despite the general capabilities of Large Language Models (LLM), these models still request fine-tuning or adaptation with customized data when meeting specific business demands. However, this process inevitably introduces new threats, particularly against the Fine-tuning based Jailbreak Attack (FJAttack) under the setting of Language-Model-as-a-Service (LMaaS), where the model's safety has been significantly compromised by fine-tuning users' uploaded examples contain just a few harmful examples. Though potential defenses have been proposed that the service providers can integrate safety examples into the fine-tuning dataset to reduce safety issues, such approaches require incorporating a substantial amount of data, making it inefficient. To effectively defend against the FJAttack with limited safety examples under LMaaS, we propose the Backdoor Enhanced Safety Alignment method inspired by an analogy with the concept of backdoor attacks. In particular, service providers will construct prefixed safety examples with a secret prompt, acting as a "backdoor trigger". By integrating prefixed safety examples into the fine-tuning dataset, the subsequent fine-tuning process effectively acts as the "backdoor attack", establishing a strong correlation between the secret prompt and safety generations. Consequently, safe responses are ensured once service providers prepend this secret prompt ahead of any user input during inference. Our comprehensive experiments demonstrate that through the Backdoor Enhanced Safety Alignment with adding as few as 11 prefixed safety examples, the maliciously fine-tuned LLMs will achieve similar safety performance as the original aligned models without harming the benign performance. Furthermore, we also present the effectiveness of our method in a more practical setting where the fine-tuning data consists of both FJAttack examples and the fine-tuning task data.
- [1113] arXiv:2402.15441 (replaced) [pdf, other]
-
Title: Active Few-Shot Fine-TuningSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
We study the question: How can we select the right data for fine-tuning to a specific task? We call this data selection problem active fine-tuning and show that it is an instance of transductive active learning, a novel generalization of classical active learning. We propose ITL, short for information-based transductive learning, an approach which samples adaptively to maximize information gained about the specified task. We are the first to show, under general regularity assumptions, that such decision rules converge uniformly to the smallest possible uncertainty obtainable from the accessible data. We apply ITL to the few-shot fine-tuning of large neural networks and show that fine-tuning with ITL learns the task with significantly fewer examples than the state-of-the-art.
- [1114] arXiv:2402.15537 (replaced) [pdf, html, other]
-
Title: Evaluating the Performance of ChatGPT for Spam Email DetectionComments: 12 pages, 4 figuresSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
Email continues to be a pivotal and extensively utilized communication medium within professional and commercial domains. Nonetheless, the prevalence of spam emails poses a significant challenge for users, disrupting their daily routines and diminishing productivity. Consequently, accurately identifying and filtering spam based on content has become crucial for cybersecurity. Recent advancements in natural language processing, particularly with large language models like ChatGPT, have shown remarkable performance in tasks such as question answering and text generation. However, its potential in spam identification remains underexplored. To fill in the gap, this study attempts to evaluate ChatGPT's capabilities for spam identification in both English and Chinese email datasets. We employ ChatGPT for spam email detection using in-context learning, which requires a prompt instruction and a few demonstrations. We also investigate how the number of demonstrations in the prompt affects the performance of ChatGPT. For comparison, we also implement five popular benchmark methods, including naive Bayes, support vector machines (SVM), logistic regression (LR), feedforward dense neural networks (DNN), and BERT classifiers. Through extensive experiments, the performance of ChatGPT is significantly worse than deep supervised learning methods in the large English dataset, while it presents superior performance on the low-resourced Chinese dataset.
- [1115] arXiv:2402.15810 (replaced) [pdf, html, other]
-
Title: OAG-Bench: A Human-Curated Benchmark for Academic Graph MiningFanjin Zhang, Shijie Shi, Yifan Zhu, Bo Chen, Yukuo Cen, Jifan Yu, Yelin Chen, Lulu Wang, Qingfei Zhao, Yuqing Cheng, Tianyi Han, Yuwei An, Dan Zhang, Weng Lam Tam, Kun Cao, Yunhe Pang, Xinyu Guan, Huihui Yuan, Jian Song, Xiaoyan Li, Yuxiao Dong, Jie TangComments: KDD'24, 9 pages, 5 appendix pagesJournal-ref: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '24), August 25--29, 2024, Barcelona, SpainSubjects: Digital Libraries (cs.DL); Computation and Language (cs.CL); Machine Learning (cs.LG)
With the rapid proliferation of scientific literature, versatile academic knowledge services increasingly rely on comprehensive academic graph mining. Despite the availability of public academic graphs, benchmarks, and datasets, these resources often fall short in multi-aspect and fine-grained annotations, are constrained to specific task types and domains, or lack underlying real academic graphs. In this paper, we present OAG-Bench, a comprehensive, multi-aspect, and fine-grained human-curated benchmark based on the Open Academic Graph (OAG). OAG-Bench covers 10 tasks, 20 datasets, 70+ baselines, and 120+ experimental results to date. We propose new data annotation strategies for certain tasks and offer a suite of data pre-processing codes, algorithm implementations, and standardized evaluation protocols to facilitate academic graph mining. Extensive experiments reveal that even advanced algorithms like large language models (LLMs) encounter difficulties in addressing key challenges in certain tasks, such as paper source tracing and scholar profiling. We also introduce the Open Academic Graph Challenge (OAG-Challenge) to encourage community input and sharing. We envisage that OAG-Bench can serve as a common ground for the community to evaluate and compare algorithms in academic graph mining, thereby accelerating algorithm development and advancement in this field. OAG-Bench is accessible at this https URL.
- [1116] arXiv:2402.15921 (replaced) [pdf, html, other]
-
Title: Pretraining Strategy for Neural PotentialsSubjects: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
We propose a mask pretraining method for Graph Neural Networks (GNNs) to improve their performance on fitting potential energy surfaces, particularly in water systems. GNNs are pretrained by recovering spatial information related to masked-out atoms from molecules, then transferred and finetuned on atomic forcefields. Through such pretraining, GNNs learn meaningful prior about structural and underlying physical information of molecule systems that are useful for downstream tasks. From comprehensive experiments and ablation studies, we show that the proposed method improves the accuracy and convergence speed compared to GNNs trained from scratch or using other pretraining techniques such as denoising. On the other hand, our pretraining method is suitable for both energy-centric and force-centric GNNs. This approach showcases its potential to enhance the performance and data efficiency of GNNs in fitting molecular force fields.
- [1117] arXiv:2402.16159 (replaced) [pdf, html, other]
-
Title: DistALANER: Distantly Supervised Active Learning Augmented Named Entity Recognition in the Open Source Software EcosystemComments: Accepted at ECML-PKDD 2024 (Long Paper)Subjects: Computation and Language (cs.CL)
With the AI revolution in place, the trend for building automated systems to support professionals in different domains such as the open source software systems, healthcare systems, banking systems, transportation systems and many others have become increasingly prominent. A crucial requirement in the automation of support tools for such systems is the early identification of named entities, which serves as a foundation for developing specialized functionalities. However, due to the specific nature of each domain, different technical terminologies and specialized languages, expert annotation of available data becomes expensive and challenging. In light of these challenges, this paper proposes a novel named entity recognition (NER) technique specifically tailored for the open-source software systems. Our approach aims to address the scarcity of annotated software data by employing a comprehensive two-step distantly supervised annotation process. This process strategically leverages language heuristics, unique lookup tables, external knowledge sources, and an active learning approach. By harnessing these powerful techniques, we not only enhance model performance but also effectively mitigate the limitations associated with cost and the scarcity of expert annotators. It is noteworthy that our model significantly outperforms the state-of-the-art LLMs by a substantial margin. We also show the effectiveness of NER in the downstream task of relation extraction.
- [1118] arXiv:2402.16602 (replaced) [pdf, html, other]
-
Title: Rethinking Negative Instances for Generative Named Entity RecognitionComments: ACL 2024 FindingsSubjects: Computation and Language (cs.CL)
Large Language Models (LLMs) have demonstrated impressive capabilities for generalizing in unseen tasks. In the Named Entity Recognition (NER) task, recent advancements have seen the remarkable improvement of LLMs in a broad range of entity domains via instruction tuning, by adopting entity-centric schema. In this work, we explore the potential enhancement of the existing methods by incorporating negative instances into training. Our experiments reveal that negative instances contribute to remarkable improvements by (1) introducing contextual information, and (2) clearly delineating label boundaries. Furthermore, we introduce an efficient longest common subsequence (LCS) matching algorithm, which is tailored to transform unstructured predictions into structured entities. By integrating these components, we present GNER, a Generative NER system that shows improved zero-shot performance across unseen entity domains. Our comprehensive evaluation illustrates our system's superiority, surpassing state-of-the-art (SoTA) methods by 9 $F_1$ score in zero-shot evaluation.
- [1119] arXiv:2402.17387 (replaced) [pdf, html, other]
-
Title: RACP: Risk-Aware Contingency Planning with Multi-Modal PredictionsComments: Accepted at IEEE Transactions on Intelligent Vehicles (T-IV)Subjects: Robotics (cs.RO)
For an autonomous vehicle to operate reliably within real-world traffic scenarios, it is imperative to assess the repercussions of its prospective actions by anticipating the uncertain intentions exhibited by other participants in the traffic environment. Driven by the pronounced multi-modal nature of human driving behavior, this paper presents an approach that leverages Bayesian beliefs over the distribution of potential policies of other road users to construct a novel risk-aware probabilistic motion planning framework. In particular, we propose a novel contingency planner that outputs long-term contingent plans conditioned on multiple possible intents for other actors in the traffic scene. The Bayesian belief is incorporated into the optimization cost function to influence the behavior of the short-term plan based on the likelihood of other agents' policies. Furthermore, a probabilistic risk metric is employed to fine-tune the balance between efficiency and robustness. Through a series of closed-loop safety-critical simulated traffic scenarios shared with human-driven vehicles, we demonstrate the practical efficacy of our proposed approach that can handle multi-vehicle scenarios.
- [1120] arXiv:2402.17954 (replaced) [pdf, html, other]
-
Title: Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance GapsComments: 23 pages. Code and artifacts at this https URLSubjects: Computation and Language (cs.CL)
Current automatic speech recognition (ASR) models are designed to be used across many languages and tasks without substantial changes. However, this broad language coverage hides performance gaps within languages, for example, across genders. Our study systematically evaluates the performance of two widely used multilingual ASR models on three datasets, encompassing 19 languages from eight language families and two speaking conditions. Our findings reveal clear gender disparities, with the advantaged group varying across languages and models. Surprisingly, those gaps are not explained by acoustic or lexical properties. However, probing internal model states reveals a correlation with gendered performance gap. I.e., the easier it is to distinguish speaker gender in a language using probes, the more the gap reduces, favoring female speakers. Our results show that gender disparities persist even in state-of-the-art models. Our findings have implications for the improvement of multilingual ASR systems, underscoring the importance of accessibility to training data and nuanced evaluation to predict and mitigate gender gaps. We release all code and artifacts at this https URL.
- [1121] arXiv:2402.18129 (replaced) [pdf, html, other]
-
Title: On the Inductive Biases of Demographic Parity-based Fair Learning AlgorithmsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
Fair supervised learning algorithms assigning labels with little dependence on a sensitive attribute have attracted great attention in the machine learning community. While the demographic parity (DP) notion has been frequently used to measure a model's fairness in training fair classifiers, several studies in the literature suggest potential impacts of enforcing DP in fair learning algorithms. In this work, we analytically study the effect of standard DP-based regularization methods on the conditional distribution of the predicted label given the sensitive attribute. Our analysis shows that an imbalanced training dataset with a non-uniform distribution of the sensitive attribute could lead to a classification rule biased toward the sensitive attribute outcome holding the majority of training data. To control such inductive biases in DP-based fair learning, we propose a sensitive attribute-based distributionally robust optimization (SA-DRO) method improving robustness against the marginal distribution of the sensitive attribute. Finally, we present several numerical results on the application of DP-based learning methods to standard centralized and distributed learning problems. The empirical findings support our theoretical results on the inductive biases in DP-based fair learning algorithms and the debiasing effects of the proposed SA-DRO method.
- [1122] arXiv:2402.18213 (replaced) [pdf, other]
-
Title: Multi-objective Differentiable Neural Architecture SearchComments: 37 pages, 27 figuresSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Pareto front profiling in multi-objective optimization (MOO), i.e. finding a diverse set of Pareto optimal solutions, is challenging, especially with expensive objectives like neural network training. Typically, in MOO neural architecture search (NAS), we aim to balance performance and hardware metrics across devices. Prior NAS approaches simplify this task by incorporating hardware constraints into the objective function, but profiling the Pareto front necessitates a computationally expensive search for each constraint. In this work, we propose a novel NAS algorithm that encodes user preferences for the trade-off between performance and hardware metrics, and yields representative and diverse architectures across multiple devices in just one search run. To this end, we parameterize the joint architectural distribution across devices and multiple objectives via a hypernetwork that can be conditioned on hardware features and preference vectors, enabling zero-shot transferability to new devices. Extensive experiments with up to 19 hardware devices and 3 objectives showcase the effectiveness and scalability of our method. Finally, we show that, without extra costs, our method outperforms existing MOO NAS methods across a broad range of qualitatively different search spaces and datasets, including MobileNetV3 on ImageNet-1k, an encoder-decoder transformer space for machine translation and a decoder-only transformer space for language modelling.
- [1123] arXiv:2402.18439 (replaced) [pdf, html, other]
-
Title: Beyond Natural Language: LLMs Leveraging Alternative Formats for Enhanced Reasoning and CommunicationWeize Chen, Chenfei Yuan, Jiarui Yuan, Yusheng Su, Chen Qian, Cheng Yang, Ruobing Xie, Zhiyuan Liu, Maosong SunComments: Code release at this https URLSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Natural language (NL) has long been the predominant format for human cognition and communication, and by extension, has been similarly pivotal in the development and application of Large Language Models (LLMs). Yet, besides NL, LLMs have seen various non-NL formats during pre-training, such as code and logical expression. NL's status as the optimal format for LLMs, particularly in single-LLM reasoning and multi-agent communication, has not been thoroughly examined. In this work, we challenge the default use of NL by exploring the utility of non-NL formats in these contexts. We show that allowing LLMs to autonomously select the most suitable format before reasoning or communicating leads to a 3.3 to 5.7\% improvement in reasoning efficiency for different LLMs, and up to a 72.7\% reduction in token usage in multi-agent communication, all while maintaining communicative effectiveness. Our comprehensive analysis further reveals that LLMs can devise a format from limited task instructions and that the devised format is effectively transferable across different LLMs. Intriguingly, the structured communication format decided by LLMs exhibits notable parallels with established agent communication languages, suggesting a natural evolution towards efficient, structured communication in agent communication. Our code is released at \url{this https URL}.
- [1124] arXiv:2402.19163 (replaced) [pdf, html, other]
-
Title: Decoupled Subgraph Federated LearningComments: Updated version. Main changes: 1. Title; 2. Added discussion on communication complexity and a pruned version of our framework; 3. Focused on our general framework for the scenario where server lacks knowledge of global graph connections, and discussed the scenario with complete knowledge in Appendix; 4. Compared FedStruct with FedPub; 5. Included results using FedStar and for federated averagingSubjects: Machine Learning (cs.LG); Information Theory (cs.IT)
We address the challenge of federated learning on graph-structured data distributed across multiple clients. Specifically, we focus on the prevalent scenario of interconnected subgraphs, where interconnections between different clients play a critical role. We present a novel framework for this scenario, named FedStruct, that harnesses deep structural dependencies. To uphold privacy, unlike existing methods, FedStruct eliminates the necessity of sharing or generating sensitive node features or embeddings among clients. Instead, it leverages explicit global graph structure information to capture inter-node dependencies. We validate the effectiveness of FedStruct through experimental results conducted on six datasets for semi-supervised node classification, showcasing performance close to the centralized approach across various scenarios, including different data partitioning methods, varying levels of label availability, and number of clients.
- [1125] arXiv:2402.19325 (replaced) [pdf, html, other]
-
Title: Do End-to-End Neural Diarization Attractors Need to Encode Speaker Characteristic Information?Comments: Accepted to Odyssey 2024. This arXiv version includes an appendix for more visualizations. Code: this https URLSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
In this paper, we apply the variational information bottleneck approach to end-to-end neural diarization with encoder-decoder attractors (EEND-EDA). This allows us to investigate what information is essential for the model. EEND-EDA utilizes attractors, vector representations of speakers in a conversation. Our analysis shows that, attractors do not necessarily have to contain speaker characteristic information. On the other hand, giving the attractors more freedom to allow them to encode some extra (possibly speaker-specific) information leads to small but consistent diarization performance improvements. Despite architectural differences in EEND systems, the notion of attractors and frame embeddings is common to most of them and not specific to EEND-EDA. We believe that the main conclusions of this work can apply to other variants of EEND. Thus, we hope this paper will be a valuable contribution to guide the community to make more informed decisions when designing new systems.
- [1126] arXiv:2403.00514 (replaced) [pdf, html, other]
-
Title: Overestimation, Overfitting, and Plasticity in Actor-Critic: the Bitter Lesson of Reinforcement LearningComments: ICML 2024Subjects: Machine Learning (cs.LG)
Recent advancements in off-policy Reinforcement Learning (RL) have significantly improved sample efficiency, primarily due to the incorporation of various forms of regularization that enable more gradient update steps than traditional agents. However, many of these techniques have been tested in limited settings, often on tasks from single simulation benchmarks and against well-known algorithms rather than a range of regularization approaches. This limits our understanding of the specific mechanisms driving RL improvements. To address this, we implemented over 60 different off-policy agents, each integrating established regularization techniques from recent state-of-the-art algorithms. We tested these agents across 14 diverse tasks from 2 simulation benchmarks, measuring training metrics related to overestimation, overfitting, and plasticity loss -- issues that motivate the examined regularization techniques. Our findings reveal that while the effectiveness of a specific regularization setup varies with the task, certain combinations consistently demonstrate robust and superior performance. Notably, a simple Soft Actor-Critic agent, appropriately regularized, reliably finds a better-performing policy within the training regime, which previously was achieved mainly through model-based approaches.
- [1127] arXiv:2403.00863 (replaced) [pdf, html, other]
-
Title: LLM-Ensemble: Optimal Large Language Model Ensemble Method for E-commerce Product Attribute Value ExtractionChenhao Fang, Xiaohan Li, Zezhong Fan, Jianpeng Xu, Kaushiki Nag, Evren Korpeoglu, Sushant Kumar, Kannan AchanComments: SIGIR 2024 industry trackSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Product attribute value extraction is a pivotal component in Natural Language Processing (NLP) and the contemporary e-commerce industry. The provision of precise product attribute values is fundamental in ensuring high-quality recommendations and enhancing customer satisfaction. The recently emerging Large Language Models (LLMs) have demonstrated state-of-the-art performance in numerous attribute extraction tasks, without the need for domain-specific training data. Nevertheless, varying strengths and weaknesses are exhibited by different LLMs due to the diversity in data, architectures, and hyperparameters. This variation makes them complementary to each other, with no single LLM dominating all others. Considering the diverse strengths and weaknesses of LLMs, it becomes necessary to develop an ensemble method that leverages their complementary potentials. In this paper, we propose a novel algorithm called LLM-ensemble to ensemble different LLMs' outputs for attribute value extraction. We iteratively learn the weights for different LLMs to aggregate the labels with weights to predict the final attribute value. Not only can our proposed method be proven theoretically optimal, but it also ensures efficient computation, fast convergence, and safe deployment. We have also conducted extensive experiments with various state-of-the-art LLMs, including Llama2-13B, Llama2-70B, PaLM-2, GPT-3.5, and GPT-4, on Walmart's internal data. Our offline metrics demonstrate that the LLM-ensemble method outperforms all the state-of-the-art single LLMs on Walmart's internal dataset. This method has been launched in several production models, leading to improved Gross Merchandise Volume (GMV), Click-Through Rate (CTR), Conversion Rate (CVR), and Add-to-Cart Rate (ATC).
- [1128] arXiv:2403.01922 (replaced) [pdf, html, other]
-
Title: FlowPrecision: Advancing FPGA-Based Real-Time Fluid Flow Estimation with Linear QuantizationComments: 6 pages, 3 figures, The 22nd International Conference on Pervasive Computing and Communications (PerCom 2024), PerConAI WorkshopSubjects: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
In industrial and environmental monitoring, achieving real-time and precise fluid flow measurement remains a critical challenge. This study applies linear quantization in FPGA-based soft sensors for fluid flow estimation, significantly enhancing Neural Network model precision by overcoming the limitations of traditional fixed-point quantization. Our approach achieves up to a 10.10% reduction in Mean Squared Error and a notable 9.39% improvement in inference speed through targeted hardware optimizations. Validated across multiple data sets, our findings demonstrate that the optimized FPGA-based quantized models can provide efficient, accurate real-time inference, offering a viable alternative to cloud-based processing in pervasive autonomous systems.
- [1129] arXiv:2403.02347 (replaced) [pdf, html, other]
-
Title: On the Convergence of Federated Learning Algorithms without Data SimilarityComments: Accepted by the IEEE Transactions on Big Data JournalSubjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
Data similarity assumptions have traditionally been relied upon to understand the convergence behaviors of federated learning methods. Unfortunately, this approach often demands fine-tuning step sizes based on the level of data similarity. When data similarity is low, these small step sizes result in an unacceptably slow convergence speed for federated methods. In this paper, we present a novel and unified framework for analyzing the convergence of federated learning algorithms without the need for data similarity conditions. Our analysis centers on an inequality that captures the influence of step sizes on algorithmic convergence performance. By applying our theorems to well-known federated algorithms, we derive precise expressions for three widely used step size schedules: fixed, diminishing, and step-decay step sizes, which are independent of data similarity conditions. Finally, we conduct comprehensive evaluations of the performance of these federated learning algorithms, employing the proposed step size strategies to train deep neural network models on benchmark datasets under varying data similarity conditions. Our findings demonstrate significant improvements in convergence speed and overall performance, marking a substantial advancement in federated learning research.
- [1130] arXiv:2403.02966 (replaced) [pdf, html, other]
-
Title: Evidence-Focused Fact Summarization for Knowledge-Augmented Zero-Shot Question AnsweringSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Recent studies have investigated utilizing Knowledge Graphs (KGs) to enhance Quesetion Answering (QA) performance of Large Language Models (LLMs), yet structured KG verbalization remains challengin. Existing methods, such as triple-form or free-form textual conversion of triple-form facts, encounter several issues. These include reduced evidence density due to duplicated entities or relationships, and reduced evidence clarity due to an inability to emphasize crucial evidence. To address these issues, we propose EFSum, an Evidence-focused Fact Summarization framework for enhanced QA with knowledge-augmented LLMs. We optimize an open-source LLM as a fact summarizer through distillation and preference alignment. Our extensive experiments show that EFSum improves LLM's zero-shot QA performance, and it is possible to ensure both the helpfulness and faithfulness of the summary.
- [1131] arXiv:2403.04453 (replaced) [pdf, html, other]
-
Title: Vlearn: Off-Policy Learning with Efficient State-Value Function EstimationSubjects: Machine Learning (cs.LG)
Existing off-policy reinforcement learning algorithms often rely on an explicit state-action-value function representation, which can be problematic in high-dimensional action spaces due to the curse of dimensionality. This reliance results in data inefficiency as maintaining a state-action-value function in such spaces is challenging. We present an efficient approach that utilizes only a state-value function as the critic for off-policy deep reinforcement learning. This approach, which we refer to as Vlearn, effectively circumvents the limitations of existing methods by eliminating the necessity for an explicit state-action-value function. To this end, we introduce a novel importance sampling loss for learning deep value functions from off-policy data. While this is common for linear methods, it has not been combined with deep value function networks. This transfer to deep methods is not straightforward and requires novel design choices such as robust policy updates, twin value function networks to avoid an optimization bias, and importance weight clipping. We also present a novel analysis of the variance of our estimate compared to commonly used importance sampling estimators such as V-trace. Our approach improves sample complexity as well as final performance and ensures consistent and robust performance across various benchmark tasks. Eliminating the state-action-value function in Vlearn facilitates a streamlined learning process, enabling more effective exploration and exploitation in complex environments.
- [1132] arXiv:2403.05286 (replaced) [pdf, html, other]
-
Title: LLM4Decompile: Decompiling Binary Code with Large Language ModelsSubjects: Programming Languages (cs.PL); Computation and Language (cs.CL)
Decompilation aims to convert binary code to high-level source code, but traditional tools like Ghidra often produce results that are difficult to read and execute. Motivated by the advancements in Large Language Models (LLMs), we propose LLM4Decompile, the first and largest open-source LLM series (1.3B to 33B) trained to decompile binary code. We optimize the LLM training process and introduce the LLM4Decompile-End models to decompile binary directly. The resulting models significantly outperform GPT-4o and Ghidra on the HumanEval and ExeBench benchmarks by over 100%. Additionally, we improve the standard refinement approach to fine-tune the LLM4Decompile-Ref models, enabling them to effectively refine the decompiled code from Ghidra and achieve a further 16.2% improvement over the LLM4Decompile-End. LLM4Decompile demonstrates the potential of LLMs to revolutionize binary code decompilation, delivering remarkable improvements in readability and executability while complementing conventional tools for optimal results. Our code, dataset, and models are released at this https URL
- [1133] arXiv:2403.05750 (replaced) [pdf, html, other]
-
Title: Decoding the AI Pen: Techniques and Challenges in Detecting AI-Generated TextSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Large Language Models (LLMs) have revolutionized the field of Natural Language Generation (NLG) by demonstrating an impressive ability to generate human-like text. However, their widespread usage introduces challenges that necessitate thoughtful examination, ethical scrutiny, and responsible practices. In this study, we delve into these challenges, explore existing strategies for mitigating them, with a particular emphasis on identifying AI-generated text as the ultimate solution. Additionally, we assess the feasibility of detection from a theoretical perspective and propose novel research directions to address the current limitations in this domain.
- [1134] arXiv:2403.06131 (replaced) [pdf, html, other]
-
Title: FewFedPIT: Towards Privacy-preserving and Few-shot Federated Instruction TuningZhuo Zhang, Jingyuan Zhang, Jintao Huang, Lizhen Qu, Hongzhi Zhang, Qifan Wang, Xun Zhou, Zenglin XuComments: Work in progressSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Instruction tuning has been identified as a crucial technique for optimizing the performance of large language models (LLMs) in generating human-aligned responses. Nonetheless, gathering diversified and superior-quality instruction data for such tuning presents notable obstacles, especially in domains with rigid privacy provisions. Federated instruction tuning (FedIT) has emerged as a promising solution, by consolidating collaborative training across multiple data owners, thereby resulting in a privacy-preserving learning model. However, FedIT encounters limitations such as scarcity of instructional data and risk of exposure to training data extraction attacks. In this paper, we propose a novel federated algorithm, FewFedPIT, designed to simultaneously enhance privacy protection and model performance of federated few-shot learning. FewFedPITcomprises three vital components on the client side: (1) synthetic data generation, which utilizes LLMs' in-context learning capacity to generate synthetic data autonomously, thus expanding the local database; (2) parameter isolation training, which individually updates the public parameters in the synthetic data and the private parameters in the local data, consequently mitigating the noise impact of the synthetic data; (3) local aggregation sharing, which mixes public and private parameters before uploading, effectively preventing data extraction attacks. Extensive experiments on three open-source datasets demonstrate the effectiveness of FewFedPITin, enhancing privacy preservation and improving federated few-shot performance.
- [1135] arXiv:2403.06398 (replaced) [pdf, html, other]
-
Title: On the Diminishing Returns of Width for Continual LearningComments: 25 pages. ICML 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
While deep neural networks have demonstrated groundbreaking performance in various settings, these models often suffer from \emph{catastrophic forgetting} when trained on new tasks in sequence. Several works have empirically demonstrated that increasing the width of a neural network leads to a decrease in catastrophic forgetting but have yet to characterize the exact relationship between width and continual learning. We design one of the first frameworks to analyze Continual Learning Theory and prove that width is directly related to forgetting in Feed-Forward Networks (FFN). Specifically, we demonstrate that increasing network widths to reduce forgetting yields diminishing returns. We empirically verify our claims at widths hitherto unexplored in prior studies where the diminishing returns are clearly observed as predicted by our theory.
- [1136] arXiv:2403.07015 (replaced) [pdf, html, other]
-
Title: Adaptive Hyperparameter Optimization for Continual Learning ScenariosSubjects: Machine Learning (cs.LG)
Hyperparameter selection in continual learning scenarios is a challenging and underexplored aspect, especially in practical non-stationary environments. Traditional approaches, such as grid searches with held-out validation data from all tasks, are unrealistic for building accurate lifelong learning systems. This paper aims to explore the role of hyperparameter selection in continual learning and the necessity of continually and automatically tuning them according to the complexity of the task at hand. Hence, we propose leveraging the nature of sequence task learning to improve Hyperparameter Optimization efficiency. By using the functional analysis of variance-based techniques, we identify the most crucial hyperparameters that have an impact on performance. We demonstrate empirically that this approach, agnostic to continual scenarios and strategies, allows us to speed up hyperparameters optimization continually across tasks and exhibit robustness even in the face of varying sequential task orders. We believe that our findings can contribute to the advancement of continual learning methodologies towards more efficient, robust and adaptable models for real-world applications.
- [1137] arXiv:2403.07088 (replaced) [pdf, html, other]
-
Title: SPA: Towards A Computational Friendly Cloud-Base and On-Devices Collaboration Seq2seq Personalized GenerationComments: 15 pages, second version of SPA(Side Plugin Adaption)Subjects: Computation and Language (cs.CL)
Large language models(LLMs) have shown its outperforming ability on various tasks and question answering. However, LLMs require substantial memory storage on low-resource devices. More critically, the computational speed on these devices is also severely limited. In this paper, we propose SPA(Side Plugin Adaption), a lightweight architecture for fast on-devices inference on the constraints of strict on-devices computation and memory constraints. Compared with other on-devices seq2seq generation, SPA could make a fast and stable inference on low-resource constraints, allowing it to obtain cost effiency. Our method establish an interaction between a pretrained LLMs on-cloud and additive parameters on-devices, which could provide the knowledge on both pretrained LLMs and featured personal feature. Further more, SPA provides a framework to keep feature-base parameters on low computational devices while leave the parameters containing general information on the high computational devices.
- [1138] arXiv:2403.07322 (replaced) [pdf, html, other]
-
Title: A Question-centric Multi-experts Contrastive Learning Framework for Improving the Accuracy and Interpretability of Deep Sequential Knowledge Tracing ModelsComments: 25 pages, 9 figures, Accepted by TKDDSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Knowledge tracing (KT) plays a crucial role in predicting students' future performance by analyzing their historical learning processes. Deep neural networks (DNNs) have shown great potential in solving the KT problem. However, there still exist some important challenges when applying deep learning techniques to model the KT process. The first challenge lies in taking the individual information of the question into modeling. This is crucial because, despite questions sharing the same knowledge component (KC), students' knowledge acquisition on homogeneous questions can vary significantly. The second challenge lies in interpreting the prediction results from existing deep learning-based KT models. In real-world applications, while it may not be necessary to have complete transparency and interpretability of the model parameters, it is crucial to present the model's prediction results in a manner that teachers find interpretable. This makes teachers accept the rationale behind the prediction results and utilize them to design teaching activities and tailored learning strategies for students. However, the inherent black-box nature of deep learning techniques often poses a hurdle for teachers to fully embrace the model's prediction results. To address these challenges, we propose a Question-centric Multi-experts Contrastive Learning framework for KT called Q-MCKT. We have provided all the datasets and code on our website at this https URL.
- [1139] arXiv:2403.07359 (replaced) [pdf, html, other]
-
Title: FSC: Few-point Shape CompletionComments: Accepted by CVPR 2024Subjects: Computer Vision and Pattern Recognition (cs.CV)
While previous studies have demonstrated successful 3D object shape completion with a sufficient number of points, they often fail in scenarios when a few points, e.g. tens of points, are observed. Surprisingly, via entropy analysis, we find that even a few points, e.g. 64 points, could retain substantial information to help recover the 3D shape of the object. To address the challenge of shape completion with very sparse point clouds, we then propose Few-point Shape Completion (FSC) model, which contains a novel dual-branch feature extractor for handling extremely sparse inputs, coupled with an extensive branch for maximal point utilization with a saliency branch for dynamic importance assignment. This model is further bolstered by a two-stage revision network that refines both the extracted features and the decoder output, enhancing the detail and authenticity of the completed point cloud. Our experiments demonstrate the feasibility of recovering 3D shapes from a few points. The proposed Few-point Shape Completion (FSC) model outperforms previous methods on both few-point inputs and many-point inputs, and shows good generalizability to different object categories.
- [1140] arXiv:2403.07675 (replaced) [pdf, html, other]
-
Title: Multichannel Long-Term Streaming Neural Speech Enhancement for Static and Moving SpeakersComments: Accepted by IEEE Signal Processing LettersSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
In this work, we extend our previously proposed offline SpatialNet for long-term streaming multichannel speech enhancement in both static and moving speaker scenarios. SpatialNet exploits spatial information, such as the spatial/steering direction of speech, for discriminating between target speech and interferences, and achieved outstanding performance. The core of SpatialNet is a narrow-band self-attention module used for learning the temporal dynamic of spatial vectors. Towards long-term streaming speech enhancement, we propose to replace the offline self-attention network with online networks that have linear inference complexity w.r.t signal length and meanwhile maintain the capability of learning long-term information. Three variants are developed based on (i) masked self-attention, (ii) Retention, a self-attention variant with linear inference complexity, and (iii) Mamba, a structured-state-space-based RNN-like network. Moreover, we investigate the length extrapolation ability of different networks, namely test on signals that are much longer than training signals, and propose a short-signal training plus long-signal fine-tuning strategy, which largely improves the length extrapolation ability of the networks within limited training time. Overall, the proposed online SpatialNet achieves outstanding speech enhancement performance for long audio streams, and for both static and moving speakers. The proposed method is open-sourced in this https URL.
- [1141] arXiv:2403.07714 (replaced) [pdf, html, other]
-
Title: StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language ModelsZhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, Yang LiuSubjects: Computation and Language (cs.CL)
Large Language Models (LLMs) have witnessed remarkable advancements in recent years, prompting the exploration of tool learning, which integrates LLMs with external tools to address diverse real-world challenges. Assessing the capability of LLMs to utilise tools necessitates large-scale and stable benchmarks. However, previous works relied on either hand-crafted online tools with limited scale, or large-scale real online APIs suffering from instability of API status. To address this problem, we introduce StableToolBench, a benchmark evolving from ToolBench, proposing a virtual API server and stable evaluation system. The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status. Meanwhile, the stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation. Experimental results demonstrate the stability of StableToolBench, and further discuss the effectiveness of API simulators, the caching system, and the evaluator system.
- [1142] arXiv:2403.07794 (replaced) [pdf, html, other]
-
Title: SIT: Fine-tuning Large Language Models with Sequential InstructionsComments: 21pages, 8 figuresSubjects: Computation and Language (cs.CL)
Despite the success of existing instruction-tuned models, we find that they usually struggle to respond to queries with multiple instructions. This impairs their performance in complex problems whose solution consists of multiple intermediate tasks. Thus, we contend that part of the fine-tuning data mixture should be sequential--containing a chain of interrelated tasks. We first approach sequential instruction tuning from a task-driven perspective, manually creating interpretable intermediate tasks for multilingual and visual question answering: namely "translate then predict" and "caption then answer". Next, we automate this process by turning instructions in existing datasets (e.g., Alpaca and FlanCoT) into diverse and complex sequential instructions, making our method general-purpose. Models that underwent our sequential instruction tuning show improved results in coding, maths, and open-ended generation. Moreover, we put forward a new benchmark named SeqEval to evaluate a model's ability to follow all the instructions in a sequence, which further corroborates the benefits of our fine-tuning method. We hope that our endeavours will open new research avenues on instruction tuning for complex tasks.
- [1143] arXiv:2403.08010 (replaced) [pdf, html, other]
-
Title: Debatrix: Multi-dimensional Debate Judge with Iterative Chronological Analysis Based on LLMSubjects: Computation and Language (cs.CL)
How can we construct an automated debate judge to evaluate an extensive, vibrant, multi-turn debate? This task is challenging, as judging a debate involves grappling with lengthy texts, intricate argument relationships, and multi-dimensional assessments. At the same time, current research mainly focuses on short dialogues, rarely touching upon the evaluation of an entire debate. In this paper, by leveraging Large Language Models (LLMs), we propose Debatrix, which makes the analysis and assessment of multi-turn debates more aligned with majority preferences. Specifically, Debatrix features a vertical, iterative chronological analysis and a horizontal, multi-dimensional evaluation collaboration. To align with real-world debate scenarios, we introduced the PanelBench benchmark, comparing our system's performance to actual debate outcomes. The findings indicate a notable enhancement over directly using LLMs for debate evaluation. Source code and benchmark data are available online at this https URL .
- [1144] arXiv:2403.08290 (replaced) [pdf, html, other]
-
Title: Fast wavefield evaluation method based on modified proxy-surface-accelerated interpolative decomposition for two-dimensional scattering problemsSubjects: Numerical Analysis (math.NA)
This paper presents a fast wavefield evaluation method for two-dimensional wave scattering problems. The proposed method is based on a modified version of proxy-surface-accelerated interpolative decomposition, making it effective even if the evaluation points are near the boundary. The commonly known fast multipole method requires the use of direct evaluations near the boundaries of scatterers because the analytical expansion of kernel functions does not converge. On the one hand, the proposed method does not require the analytical expansion of kernel functions. The validity and effectiveness of the proposed method are demonstrated using numerical examples.
- [1145] arXiv:2403.08974 (replaced) [pdf, html, other]
-
Title: $TrIND$: Representing Anatomical Trees by Denoising Diffusion of Implicit Neural FieldsComments: Accepted to MICCAI 2024Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Anatomical trees play a central role in clinical diagnosis and treatment planning. However, accurately representing anatomical trees is challenging due to their varying and complex topology and geometry. Traditional methods for representing tree structures, captured using medical imaging, while invaluable for visualizing vascular and bronchial networks, exhibit drawbacks in terms of limited resolution, flexibility, and efficiency. Recently, implicit neural representations (INRs) have emerged as a powerful tool for representing shapes accurately and efficiently. We propose a novel approach, $TrIND$, for representing anatomical trees using INR, while also capturing the distribution of a set of trees via denoising diffusion in the space of INRs. We accurately capture the intricate geometries and topologies of anatomical trees at any desired resolution. Through extensive qualitative and quantitative evaluation, we demonstrate high-fidelity tree reconstruction with arbitrary resolution yet compact storage, and versatility across anatomical sites and tree complexities. The code is available at: \texttt{\url{this https URL}}.
- [1146] arXiv:2403.09502 (replaced) [pdf, html, other]
-
Title: EquiAV: Leveraging Equivariance for Audio-Visual Contrastive LearningComments: 15 pages, 3 figures; Accepted to ICML 2024 (camera ready version)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Recent advancements in self-supervised audio-visual representation learning have demonstrated its potential to capture rich and comprehensive representations. However, despite the advantages of data augmentation verified in many learning methods, audio-visual learning has struggled to fully harness these benefits, as augmentations can easily disrupt the correspondence between input pairs. To address this limitation, we introduce EquiAV, a novel framework that leverages equivariance for audio-visual contrastive learning. Our approach begins with extending equivariance to audio-visual learning, facilitated by a shared attention-based transformation predictor. It enables the aggregation of features from diverse augmentations into a representative embedding, providing robust supervision. Notably, this is achieved with minimal computational overhead. Extensive ablation studies and qualitative results verify the effectiveness of our method. EquiAV outperforms previous works across various audio-visual benchmarks. The code is available on this https URL.
- [1147] arXiv:2403.10258 (replaced) [pdf, html, other]
-
Title: Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language ModelsComments: 19 pagesSubjects: Computation and Language (cs.CL)
Large language models (LLMs) have demonstrated multilingual capabilities; yet, they are mostly English-centric due to the imbalanced training corpora. Existing works leverage this phenomenon to improve their multilingual performances through translation, primarily on natural language processing (NLP) tasks. This work extends the evaluation from NLP tasks to real user queries and from English-centric LLMs to non-English-centric LLMs. While translation into English can help improve the performance of multilingual NLP tasks for English-centric LLMs, it may not be optimal for all scenarios. For culture-related tasks that need deep language understanding, prompting in the native language tends to be more promising as it better captures the nuances of culture and language. Our experiments reveal varied behaviors among different LLMs and tasks in the multilingual context. Therefore, we advocate for more comprehensive multilingual evaluation and more efforts toward developing multilingual LLMs beyond English-centric ones.
- [1148] arXiv:2403.10506 (replaced) [pdf, html, other]
-
Title: HumanoidBench: Simulated Humanoid Benchmark for Whole-Body Locomotion and ManipulationSubjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Humanoid robots hold great promise in assisting humans in diverse environments and tasks, due to their flexibility and adaptability leveraging human-like morphology. However, research in humanoid robots is often bottlenecked by the costly and fragile hardware setups. To accelerate algorithmic research in humanoid robots, we present a high-dimensional, simulated robot learning benchmark, HumanoidBench, featuring a humanoid robot equipped with dexterous hands and a variety of challenging whole-body manipulation and locomotion tasks. Our findings reveal that state-of-the-art reinforcement learning algorithms struggle with most tasks, whereas a hierarchical learning approach achieves superior performance when supported by robust low-level policies, such as walking or reaching. With HumanoidBench, we provide the robotics community with a platform to identify the challenges arising when solving diverse tasks with humanoid robots, facilitating prompt verification of algorithms and ideas. The open-source code is available at this https URL.
- [1149] arXiv:2403.11381 (replaced) [pdf, html, other]
-
Title: Can LLM-Augmented autonomous agents cooperate?, An evaluation of their cooperative capabilities through Melting PotManuel Mosquera, Juan Sebastian Pinzon, Manuel Rios, Yesid Fonseca, Luis Felipe Giraldo, Nicanor Quijano, Ruben ManriqueSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
As the field of AI continues to evolve, a significant dimension of this progression is the development of Large Language Models and their potential to enhance multi-agent artificial intelligence systems. This paper explores the cooperative capabilities of Large Language Model-augmented Autonomous Agents (LAAs) using the well-known Meltin Pot environments along with reference models such as GPT4 and GPT3.5. Preliminary results suggest that while these agents demonstrate a propensity for cooperation, they still struggle with effective collaboration in given environments, emphasizing the need for more robust architectures. The study's contributions include an abstraction layer to adapt Melting Pot game scenarios for LLMs, the implementation of a reusable architecture for LLM-mediated agent development - which includes short and long-term memories and different cognitive modules, and the evaluation of cooperation capabilities using a set of metrics tied to the Melting Pot's "Commons Harvest" game. The paper closes, by discussing the limitations of the current architectural framework and the potential of a new set of modules that fosters better cooperation among LAAs.
- [1150] arXiv:2403.12660 (replaced) [pdf, html, other]
-
Title: ERASE: Benchmarking Feature Selection Methods for Deep Recommender SystemsPengyue Jia, Yejing Wang, Zhaocheng Du, Xiangyu Zhao, Yichao Wang, Bo Chen, Wanyu Wang, Huifeng Guo, Ruiming TangComments: Accepted to KDD 2024Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Deep Recommender Systems (DRS) are increasingly dependent on a large number of feature fields for more precise recommendations. Effective feature selection methods are consequently becoming critical for further enhancing the accuracy and optimizing storage efficiencies to align with the deployment demands. This research area, particularly in the context of DRS, is nascent and faces three core challenges. Firstly, variant experimental setups across research papers often yield unfair comparisons, obscuring practical insights. Secondly, the existing literature's lack of detailed analysis on selection attributes, based on large-scale datasets and a thorough comparison among selection techniques and DRS backbones, restricts the generalizability of findings and impedes deployment on DRS. Lastly, research often focuses on comparing the peak performance achievable by feature selection methods, an approach that is typically computationally infeasible for identifying the optimal hyperparameters and overlooks evaluating the robustness and stability of these methods. To bridge these gaps, this paper presents ERASE, a comprehensive bEnchmaRk for feAture SElection for DRS. ERASE comprises a thorough evaluation of eleven feature selection methods, covering both traditional and deep learning approaches, across four public datasets, private industrial datasets, and a real-world commercial platform, achieving significant enhancement. Our code is available online for ease of reproduction.
- [1151] arXiv:2403.13086 (replaced) [pdf, html, other]
-
Title: Listenable Maps for Audio ClassifiersComments: Accepted to ICML 2024 (Oral)Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
Despite the impressive performance of deep learning models across diverse tasks, their complexity poses challenges for interpretation. This challenge is particularly evident for audio signals, where conveying interpretations becomes inherently difficult. To address this issue, we introduce Listenable Maps for Audio Classifiers (L-MAC), a posthoc interpretation method that generates faithful and listenable interpretations. L-MAC utilizes a decoder on top of a pretrained classifier to generate binary masks that highlight relevant portions of the input audio. We train the decoder with a loss function that maximizes the confidence of the classifier decision on the masked-in portion of the audio while minimizing the probability of model output for the masked-out portion. Quantitative evaluations on both in-domain and out-of-domain data demonstrate that L-MAC consistently produces more faithful interpretations than several gradient and masking-based methodologies. Furthermore, a user study confirms that, on average, users prefer the interpretations generated by the proposed technique.
- [1152] arXiv:2403.13658 (replaced) [pdf, html, other]
-
Title: Multimodal Variational Autoencoder for Low-cost Cardiac Hemodynamics Instability DetectionMohammod N. I. Suvon, Prasun C. Tripathi, Wenrui Fan, Shuo Zhou, Xianyuan Liu, Samer Alabed, Venet Osmani, Andrew J. Swift, Chen Chen, Haiping LuSubjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Recent advancements in non-invasive detection of cardiac hemodynamic instability (CHDI) primarily focus on applying machine learning techniques to a single data modality, e.g. cardiac magnetic resonance imaging (MRI). Despite their potential, these approaches often fall short especially when the size of labeled patient data is limited, a common challenge in the medical domain. Furthermore, only a few studies have explored multimodal methods to study CHDI, which mostly rely on costly modalities such as cardiac MRI and echocardiogram. In response to these limitations, we propose a novel multimodal variational autoencoder ($\text{CardioVAE}_\text{X,G}$) to integrate low-cost chest X-ray (CXR) and electrocardiogram (ECG) modalities with pre-training on a large unlabeled dataset. Specifically, $\text{CardioVAE}_\text{X,G}$ introduces a novel tri-stream pre-training strategy to learn both shared and modality-specific features, thus enabling fine-tuning with both unimodal and multimodal datasets. We pre-train $\text{CardioVAE}_\text{X,G}$ on a large, unlabeled dataset of $50,982$ subjects from a subset of MIMIC database and then fine-tune the pre-trained model on a labeled dataset of $795$ subjects from the ASPIRE registry. Comprehensive evaluations against existing methods show that $\text{CardioVAE}_\text{X,G}$ offers promising performance (AUROC $=0.79$ and Accuracy $=0.77$), representing a significant step forward in non-invasive prediction of CHDI. Our model also excels in producing fine interpretations of predictions directly associated with clinical features, thereby supporting clinical decision-making.
- [1153] arXiv:2403.14084 (replaced) [pdf, other]
-
Title: Learning-based Multi-continuum Model for Multiscale Flow ProblemsComments: Corrected typosSubjects: Numerical Analysis (math.NA); Machine Learning (cs.LG)
Multiscale problems can usually be approximated through numerical homogenization by an equation with some effective parameters that can capture the macroscopic behavior of the original system on the coarse grid to speed up the simulation. However, this approach usually assumes scale separation and that the heterogeneity of the solution can be approximated by the solution average in each coarse block. For complex multiscale problems, the computed single effective properties/continuum might be inadequate. In this paper, we propose a novel learning-based multi-continuum model to enrich the homogenized equation and improve the accuracy of the single continuum model for multiscale problems with some given data. Without loss of generalization, we consider a two-continuum case. The first flow equation keeps the information of the original homogenized equation with an additional interaction term. The second continuum is newly introduced, and the effective permeability in the second flow equation is determined by a neural network. The interaction term between the two continua aligns with that used in the Dual-porosity model but with a learnable coefficient determined by another neural network. The new model with neural network terms is then optimized using trusted data. We discuss both direct back-propagation and the adjoint method for the PDE-constraint optimization problem. Our proposed learning-based multi-continuum model can resolve multiple interacted media within each coarse grid block and describe the mass transfer among them, and it has been demonstrated to significantly improve the simulation results through numerical experiments involving both linear and nonlinear flow equations.
- [1154] arXiv:2403.14370 (replaced) [pdf, other]
-
Title: SyncTweedies: A General Generative Framework Based on Synchronized DiffusionsComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
We introduce a general framework for generating diverse visual content, including ambiguous images, panorama images, mesh textures, and Gaussian splat textures, by synchronizing multiple diffusion processes. We present exhaustive investigation into all possible scenarios for synchronizing multiple diffusion processes through a canonical space and analyze their characteristics across applications. In doing so, we reveal a previously unexplored case: averaging the outputs of Tweedie's formula while conducting denoising in multiple instance spaces. This case also provides the best quality with the widest applicability to downstream tasks. We name this case SyncTweedies. In our experiments generating visual content aforementioned, we demonstrate the superior quality of generation by SyncTweedies compared to other synchronization methods, optimization-based and iterative-update-based methods.
- [1155] arXiv:2403.15412 (replaced) [pdf, html, other]
-
Title: Towards Measuring and Modeling "Culture" in LLMs: A SurveyMuhammad Farid Adilazuarda, Sagnik Mukherjee, Pradhyumna Lavania, Siddhant Singh, Alham Fikri Aji, Jacki O'Neill, Ashutosh Modi, Monojit ChoudhurySubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
We present a survey of more than 90 recent papers that aim to study cultural representation and inclusion in large language models (LLMs). We observe that none of the studies explicitly define "culture, which is a complex, multifaceted concept; instead, they probe the models on some specially designed datasets which represent certain aspects of "culture". We call these aspects the proxies of culture, and organize them across two dimensions of demographic and semantic proxies. We also categorize the probing methods employed. Our analysis indicates that only certain aspects of ``culture,'' such as values and objectives, have been studied, leaving several other interesting and important facets, especially the multitude of semantic domains (Thompson et al., 2020) and aboutness (Hershcovich et al., 2022), unexplored. Two other crucial gaps are the lack of robustness of probing techniques and situated studies on the impact of cultural mis- and under-representation in LLM-based applications.
- [1156] arXiv:2403.15879 (replaced) [pdf, html, other]
-
Title: TrustSQL: Benchmarking Text-to-SQL Reliability with Penalty-Based ScoringComments: under reviewSubjects: Artificial Intelligence (cs.AI)
Text-to-SQL enables users to interact with databases using natural language, simplifying the retrieval and synthesis of information. Despite the remarkable success of large language models (LLMs) in translating natural language questions into SQL queries, widespread deployment remains limited due to two primary challenges. First, the effective use of text-to-SQL models depends on users' understanding of the model's capabilities-the scope of questions the model can correctly answer. Second, the absence of abstention mechanisms can lead to incorrect SQL generation going unnoticed, thereby undermining trust in the model's output. To enable wider deployment, it is crucial to address these challenges in model design and enhance model evaluation to build trust in the model's output. To this end, we introduce TrustSQL, a novel comprehensive benchmark designed to evaluate text-to-SQL reliability-defined as a model's ability to correctly handle any type of input question by generating correct SQL queries for feasible questions and abstaining from generating infeasible ones (e.g., due to schema incompatibility or functionalities beyond SQL). We evaluate existing methods using a novel penalty-based scoring metric with two modeling approaches: (1) pipeline-based methods combining SQL generators with infeasible question detectors and SQL error detectors for abstention; and (2) unified methods using a single model for the entire task. Our experimental results reveal that achieving high scores under severe penalties requires significant effort and provide a new perspective on developing text-to-SQL models for safer deployment. TrustSQL is available at this https URL.
- [1157] arXiv:2403.15901 (replaced) [pdf, html, other]
-
Title: MatchSeg: Towards Better Segmentation via Reference Image MatchingSubjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Recently, automated medical image segmentation methods based on deep learning have achieved great success. However, they heavily rely on large annotated datasets, which are costly and time-consuming to acquire. Few-shot learning aims to overcome the need for annotated data by using a small labeled dataset, known as a support set, to guide predicting labels for new, unlabeled images, known as the query set. Inspired by this paradigm, we introduce MatchSeg, a novel framework that enhances medical image segmentation through strategic reference image matching. We leverage contrastive language-image pre-training (CLIP) to select highly relevant samples when defining the support set. Additionally, we design a joint attention module to strengthen the interaction between support and query features, facilitating a more effective knowledge transfer between support and query sets. We validated our method across four public datasets. Experimental results demonstrate superior segmentation performance and powerful domain generalization ability of MatchSeg against existing methods for domain-specific and cross-domain segmentation tasks. Our code is made available at this https URL
- [1158] arXiv:2403.17209 (replaced) [pdf, other]
-
Title: Generation of Asset Administration Shell with Large Language Model Agents: Towards Semantic Interoperability in Digital Twins in the Context of Industry 4.0Comments: accepted in journal IEEE ACCESS this https URLSubjects: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
This research introduces a novel approach for achieving semantic interoperability in digital twins and assisting the creation of Asset Administration Shell (AAS) as digital twin model within the context of Industry 4.0. The foundational idea of our research is that the communication based on semantics and the generation of meaningful textual data are directly linked, and we posit that these processes are equivalent if the exchanged information can be serialized in text form. Based on this, we construct a "semantic node" data structure in our research to capture the semantic essence of textual data. Then, a system powered by large language models is designed and implemented to process the "semantic node" and generate standardized digital twin models from raw textual data collected from datasheets describing technical assets. Our evaluation demonstrates an effective generation rate of 62-79%, indicating a substantial proportion of the information from the source text can be translated error-free to the target digital twin instance model with the generative capability of large language models. This result has a direct application in the context of Industry 4.0, and the designed system is implemented as a data model generation tool for reducing the manual effort in creating AAS model. In our evaluation, a comparative analysis of different LLMs and an in-depth ablation study of Retrieval-Augmented Generation (RAG) mechanisms provide insights into the effectiveness of LLM systems for interpreting technical concepts and translating data. Our findings emphasize LLMs' capability to automate AAS instance creation and contribute to the broader field of semantic interoperability for digital twins in industrial applications. The prototype implementation and evaluation results are presented on our GitHub Repository: this https URL.
- [1159] arXiv:2403.17371 (replaced) [pdf, html, other]
-
Title: Robust Containment Queries over Collections of Rational Parametric Curves via Generalized Winding NumbersComments: 14 pages, 18 figuresSubjects: Computational Geometry (cs.CG); Graphics (cs.GR); Numerical Analysis (math.NA)
Point containment queries for regions bound by watertight geometric surfaces, i.e., closed and without self-intersections, can be evaluated straightforwardly with a number of well-studied algorithms. When this assumption on domain geometry is not met, such methods are either unusable, or prone to misclassifications that can lead to cascading errors in downstream applications. More robust point classification schemes based on generalized winding numbers have been proposed, as they are indifferent to these imperfections. However, existing algorithms are limited to point clouds and collections of linear elements. We extend this methodology to encompass more general curved shapes with an algorithm that evaluates the winding number scalar field over unstructured collections of rational parametric curves. In particular, we evaluate the winding number for each curve independently, making the derived containment query robust to how the curves are arranged. We ensure geometric fidelity in our queries by treating each curve as equivalent to an adaptively constructed polyline that provably has the same generalized winding number at the point of interest. Our algorithm is numerically stable for points that are arbitrarily close to the model, and explicitly treats points that are coincident with curves. We demonstrate the improvements in computational performance granted by this method over conventional techniques as well as the robustness induced by its application.
- [1160] arXiv:2403.18111 (replaced) [pdf, html, other]
-
Title: Scrolly2Reel: Retargeting Graphics for Social Media Using Narrative BeatsComments: 9 pages, 3 figuresSubjects: Human-Computer Interaction (cs.HC)
Content retargeting is crucial for social media creators. Once great content is created, it is important to reach as broad an audience as possible. This is particularly important in journalism where younger audiences are shifting away from print and towards short-video platforms. Many newspapers already create rich graphics for the web that they want to be able to reuse for social media. One example is scrollytelling sequences or "scrollies" -- immersive articles with graphics like animation, charts, and 3D visualizations that appear as a user scrolls. We present a system that helps transform scrollies into social media videos. By using the scriptwriting concept of narrative beats to extract fundamental storytelling units, we can create videos that are more aligned with narration, and allow for better pacing and stylistic changes. Narrative beats are thus an important primitive to retargeting content that matches the style of a new medium while maintaining the cohesiveness of the original content.
- [1161] arXiv:2403.18564 (replaced) [pdf, html, other]
-
Title: Reachability Analysis Using Constrained Polynomial Logical ZonotopesComments: IEEE Control Systems Letters (2024)Subjects: Systems and Control (eess.SY); Logic in Computer Science (cs.LO)
In this paper, we propose reachability analysis using constrained polynomial logical zonotopes. We perform reachability analysis to compute the set of states that could be reached. To do this, we utilize a recently introduced set representation called polynomial logical zonotopes for performing computationally efficient and exact reachability analysis on logical systems. Notably, polynomial logical zonotopes address the "curse of dimensionality" when analyzing the reachability of logical systems since the set representation can represent $2^h$ binary vectors using $h$ generators. After finishing the reachability analysis, the formal verification involves verifying whether the intersection of the calculated reachable set and the unsafe set is empty or not. Polynomial logical zonotopes lack closure under intersections, prompting the formulation of constrained polynomial logical zonotopes, which preserve the computational efficiency and exactness of polynomial logical zonotopes for reachability analysis while enabling exact intersections. Additionally, an extensive empirical study is presented to demonstrate and validate the advantages of constrained polynomial logical zonotopes.
- [1162] arXiv:2403.19050 (replaced) [pdf, html, other]
-
Title: Detecting Generative Parroting through Overfitting Masked AutoencodersComments: Accepted to CVPR 2024, Responsible Generative AI workshopSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
The advent of generative AI models has revolutionized digital content creation, yet it introduces challenges in maintaining copyright integrity due to generative parroting, where models mimic their training data too closely. Our research presents a novel approach to tackle this issue by employing an overfitted Masked Autoencoder (MAE) to detect such parroted samples effectively. We establish a detection threshold based on the mean loss across the training dataset, allowing for the precise identification of parroted content in modified datasets. Preliminary evaluations demonstrate promising results, suggesting our method's potential to ensure ethical use and enhance the legal compliance of generative models.
- [1163] arXiv:2403.19888 (replaced) [pdf, html, other]
-
Title: MambaMixer: Efficient Selective State Space Models with Dual Token and Channel SelectionSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Recent advances in deep learning have mainly relied on Transformers due to their data dependency and ability to learn at scale. The attention module in these architectures, however, exhibits quadratic time and space in input size, limiting their scalability for long-sequence modeling. State Space Models (SSMs), and more specifically Selective SSMs (S6), with efficient hardware-aware implementation, have shown promising potential for long causal sequence modeling. They, however, use separate blocks for each channel and fail to filter irrelevant channels and capture inter-channel dependencies. Natural attempt to mix information across channels using MLP, attention, or SSMs results in further instability in the training of SSMs for large networks and/or nearly double the number of parameters. We present the MambaMixer block, a new SSM-based architecture with data-dependent weights that uses a dual selection mechanism across tokens and channels-called Selective Token and Channel Mixer. To mitigate doubling the number of parameters, we present a new non-causal heuristic of the S6 block with a hardware-friendly implementation. We further present an efficient variant of MambaMixer, called QSMixer, that mixes information along both sequence and embedding dimensions. As a proof of concept, we design Vision MambaMixer (ViM2) and Vision QSMixer (ViQS) architectures. To enhance their ability to capture spatial information in images, we present Switch of Scans (SoS) that dynamically uses a set of useful image scans to traverse image patches. We evaluate the performance of our methods in image classification, segmentation, and object detection. Our results underline the importance of selectively mixing across both tokens and channels and show the competitive (resp. superior) performance of our methods with well-established vision models (resp. SSM-based models).
- [1164] arXiv:2403.20150 (replaced) [pdf, html, other]
-
Title: TFB: Towards Comprehensive and Fair Benchmarking of Time Series Forecasting MethodsXiangfei Qiu, Jilin Hu, Lekui Zhou, Xingjian Wu, Junyang Du, Buang Zhang, Chenjuan Guo, Aoying Zhou, Christian S. Jensen, Zhenli Sheng, Bin YangComments: Directly accepted by PVLDB 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Time series are generated in diverse domains such as economic, traffic, health, and energy, where forecasting of future values has numerous important applications. Not surprisingly, many forecasting methods are being proposed. To ensure progress, it is essential to be able to study and compare such methods empirically in a comprehensive and reliable manner. To achieve this, we propose TFB, an automated benchmark for Time Series Forecasting (TSF) methods. TFB advances the state-of-the-art by addressing shortcomings related to datasets, comparison methods, and evaluation pipelines: 1) insufficient coverage of data domains, 2) stereotype bias against traditional methods, and 3) inconsistent and inflexible pipelines. To achieve better domain coverage, we include datasets from 10 different domains: traffic, electricity, energy, the environment, nature, economic, stock markets, banking, health, and the web. We also provide a time series characterization to ensure that the selected datasets are comprehensive. To remove biases against some methods, we include a diverse range of methods, including statistical learning, machine learning, and deep learning methods, and we also support a variety of evaluation strategies and metrics to ensure a more comprehensive evaluations of different methods. To support the integration of different methods into the benchmark and enable fair comparisons, TFB features a flexible and scalable pipeline that eliminates biases. Next, we employ TFB to perform a thorough evaluation of 21 Univariate Time Series Forecasting (UTSF) methods on 8,068 univariate time series and 14 Multivariate Time Series Forecasting (MTSF) methods on 25 datasets. The benchmark code and data are available at this https URL.
- [1165] arXiv:2403.20253 (replaced) [pdf, html, other]
-
Title: MedCLIP-SAM: Bridging Text and Image Towards Universal Medical Image SegmentationComments: 10 pages, 2 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Medical image segmentation of anatomical structures and pathology is crucial in modern clinical diagnosis, disease study, and treatment planning. To date, great progress has been made in deep learning-based segmentation techniques, but most methods still lack data efficiency, generalizability, and interactability. Consequently, the development of new, precise segmentation methods that demand fewer labeled datasets is of utmost importance in medical image analysis. Recently, the emergence of foundation models, such as CLIP and Segment-Anything-Model (SAM), with comprehensive cross-domain representation opened the door for interactive and universal image segmentation. However, exploration of these models for data-efficient medical image segmentation is still limited, but is highly necessary. In this paper, we propose a novel framework, called MedCLIP-SAM that combines CLIP and SAM models to generate segmentation of clinical scans using text prompts in both zero-shot and weakly supervised settings. To achieve this, we employed a new Decoupled Hard Negative Noise Contrastive Estimation (DHN-NCE) loss to fine-tune the BiomedCLIP model and the recent gScoreCAM to generate prompts to obtain segmentation masks from SAM in a zero-shot setting. Additionally, we explored the use of zero-shot segmentation labels in a weakly supervised paradigm to improve the segmentation quality further. By extensively testing three diverse segmentation tasks and medical image modalities (breast tumor ultrasound, brain tumor MRI, and lung X-ray), our proposed framework has demonstrated excellent accuracy. Code is available at this https URL.
- [1166] arXiv:2404.01247 (replaced) [pdf, html, other]
-
Title: An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevanceSubjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Given the rise of multimedia content, human translators increasingly focus on culturally adapting not only words but also other modalities such as images to convey the same meaning. While several applications stand to benefit from this, machine translation systems remain confined to dealing with language in speech and text. In this work, we take a first step towards translating images to make them culturally relevant. First, we build three pipelines comprising state-of-the-art generative models to do the task. Next, we build a two-part evaluation dataset: i) concept: comprising 600 images that are cross-culturally coherent, focusing on a single concept per image, and ii) application: comprising 100 images curated from real-world applications. We conduct a multi-faceted human evaluation of translated images to assess for cultural relevance and meaning preservation. We find that as of today, image-editing models fail at this task, but can be improved by leveraging LLMs and retrievers in the loop. Best pipelines can only translate 5% of images for some countries in the easier concept dataset and no translation is successful for some countries in the application dataset, highlighting the challenging nature of the task. Our code and data is released here: this https URL.
- [1167] arXiv:2404.03147 (replaced) [pdf, html, other]
-
Title: Eigenpruning: an Interpretability-Inspired PEFT MethodComments: Extended abstract accepted to LatinX at NAACL 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
We introduce eigenpruning, a method that removes singular values from weight matrices in an LLM to improve its performance in a particular task. This method is inspired by interpretability methods designed to automatically find subnetworks of a model which solve a specific task. In our tests, the pruned model outperforms the original model by a large margin, while only requiring minimal computation to prune the weight matrices. In the case of a small synthetic task in integer multiplication, the Phi-2 model can improve its accuracy in the test set from 13.75% to 97.50%. Interestingly, these results seem to indicate the existence of a computation path that can solve the task very effectively, but it was not being used by the original model. Finally, we publicly release our implementation.
- [1168] arXiv:2404.04589 (replaced) [pdf, html, other]
-
Title: ars548_ros. An ARS 548 RDI radar driver for ROSFernando Fernández-Calatayud, Lucía Coto-Elena, David Alejo, José J. Carpio-Jiménez, Fernando Caballero, Luis MerinoComments: 19 pages, 6 figures and 22 referencesSubjects: Robotics (cs.RO)
The ARS 548 RDI Radar is a premium model of the fifth generation of 77 GHz long range radar sensors with new RF antenna arrays, which offer digital beam forming. This radar measures independently the distance, speed and angle of objects without any reflectors in one measurement cycle based on Pulse Compression with New Frequency Modulation. Unfortunately, to the best of our knowledge, there are no open source drivers available for Linux systems to enable users to analyze the data acquired by the sensor. In this paper, we present a driver that can interpret the data from the ARS 548 RDI sensor and make it available over the Robot Operating System versions 1 and 2 (ROS and ROS2). Thus, these data can be stored, represented, and analyzed using the powerful tools offered by ROS. Besides, our driver offers advanced object features provided by the sensor, such as relative estimated velocity and acceleration of each object, its orientation and angular velocity. We focus on the configuration of the sensor and the use of our driver including its filtering and representation tools. Besides, we offer a video tutorial to help in its configuration process. Finally, a dataset acquired with this sensor and an Ouster OS1-32 LiDAR sensor, to have baseline measurements, is available, so that the user can check the correctness of our driver.
- [1169] arXiv:2404.04698 (replaced) [pdf, html, other]
-
Title: CEAR: Comprehensive Event Camera Dataset for Rapid Perception of Agile Quadruped RobotsComments: 8 pages, 7 figuresSubjects: Robotics (cs.RO)
When legged robots perform agile movements, traditional RGB cameras often produce blurred images, posing a challenge for rapid perception. Event cameras have emerged as a promising solution for capturing rapid perception and coping with challenging lighting conditions thanks to their low latency, high temporal resolution, and high dynamic range. However, integrating event cameras into agile-legged robots is still largely unexplored. Notably, no dataset including event cameras has yet been developed for the context of agile quadruped robots. To bridge this gap, we introduce CEAR, a dataset comprising data from an event camera, an RGB-D camera, an IMU, a LiDAR, and joint encoders, all mounted on a dynamic quadruped, Mini Cheetah robot. This comprehensive dataset features more than 100 sequences from real-world environments, encompassing various indoor and outdoor environments, different lighting conditions, a range of robot gaits (e.g., trotting, bounding, pronking), as well as acrobatic movements like backflip. To our knowledge, this is the first event camera dataset capturing the dynamic and diverse quadruped robot motions under various setups, developed to advance research in rapid perception for quadruped robots.
- [1170] arXiv:2404.05737 (replaced) [pdf, other]
-
Title: Soil respiration signals in response to sustainable soil management practices enhance soil organic carbon stocksComments: The author was unaware that there was no legal rights to use a portion of the data used in this studySubjects: Machine Learning (cs.LG)
Development of a spatial-temporal and data-driven model of soil respiration at the global scale based on soil temperature, yearly soil moisture, and soil organic carbon (C) estimates. Prediction of soil respiration on an annual basis (1991-2018) with relatively high accuracy (NSE 0.69, CCC 0.82). Lower soil respiration trends, higher soil respiration magnitudes, and higher soil organic C stocks across areas experiencing the presence of sustainable soil management practices.
- [1171] arXiv:2404.06406 (replaced) [pdf, html, other]
-
Title: Emergent Dynamics in Neural Cellular AutomataComments: 2 pagesSubjects: Computer Vision and Pattern Recognition (cs.CV)
Neural Cellular Automata (NCA) models are trainable variations of traditional Cellular Automata (CA). Emergent motion in the patterns created by NCA has been successfully applied to synthesize dynamic textures. However, the conditions required for an NCA to display dynamic patterns remain unexplored. Here, we investigate the relationship between the NCA architecture and the emergent dynamics of the trained models. Specifically, we vary the number of channels in the cell state and the number of hidden neurons in the MultiLayer Perceptron (MLP), and draw a relationship between the combination of these two variables and the motion strength between successive frames. Our analysis reveals that the disparity and proportionality between these two variables have a strong correlation with the emergent dynamics in the NCA output. We thus propose a design principle for creating dynamic NCA.
- [1172] arXiv:2404.06421 (replaced) [pdf, html, other]
-
Title: Efficient Training of Probabilistic Neural Networks for Survival AnalysisSubjects: Machine Learning (cs.LG)
Variational Inference (VI) is a commonly used technique for approximate Bayesian inference and uncertainty estimation in deep learning models, yet it comes at a computational cost, as it doubles the number of trainable parameters to represent uncertainty. This rapidly becomes challenging in high-dimensional settings and motivates the use of alternative techniques for inference, such as Monte Carlo Dropout (MCD) or Spectral-normalized Neural Gaussian Process (SNGP). However, such methods have seen little adoption in survival analysis, and VI remains the prevalent approach for training probabilistic neural networks. In this paper, we investigate how to train deep probabilistic survival models in large datasets without introducing additional overhead in model complexity. To achieve this, we adopt three probabilistic approaches, namely VI, MCD, and SNGP, and evaluate them in terms of their prediction performance, calibration performance, and model complexity. In the context of probabilistic survival analysis, we investigate whether non-VI techniques can offer comparable or possibly improved prediction performance and uncertainty calibration compared to VI. In the MIMIC-IV dataset, we find that MCD aligns with VI in terms of the concordance index (0.748 vs. 0.743) and mean absolute error (254.9 vs. 254.7) using hinge loss, while providing C-calibrated uncertainty estimates. Moreover, our SNGP implementation provides D-calibrated survival functions in all datasets compared to VI (4/4 vs. 2/4, respectively). Our work encourages the use of techniques alternative to VI for survival analysis in high-dimensional datasets, where computational efficiency and overhead are of concern.
- [1173] arXiv:2404.06721 (replaced) [pdf, html, other]
-
Title: Poisoning Prevention in Federated Learning and Differential Privacy via Stateful Proofs of ExecutionSubjects: Cryptography and Security (cs.CR)
The rise in IoT-driven distributed data analytics, coupled with increasing privacy concerns, has led to a demand for effective privacy-preserving and federated data collection/model training mechanisms. In response, approaches such as Federated Learning (FL) and Local Differential Privacy (LDP) have been proposed and attracted much attention over the past few years. However, they still share the common limitation of being vulnerable to poisoning attacks wherein adversaries compromising edge devices feed forged (a.k.a. poisoned) data to aggregation back-ends, undermining the integrity of FL/LDP results.
In this work, we propose a system-level approach to remedy this issue based on a novel security notion of Proofs of Stateful Execution (PoSX) for IoT/embedded devices' software. To realize the PoSX concept, we design SLAPP: a System-Level Approach for Poisoning Prevention. SLAPP leverages commodity security features of embedded devices - in particular ARM TrustZoneM security extensions - to verifiably bind raw sensed data to their correct usage as part of FL/LDP edge device routines. As a consequence, it offers robust security guarantees against poisoning. Our evaluation, based on real-world prototypes featuring multiple cryptographic primitives and data collection schemes, showcases SLAPP's security and low overhead. - [1174] arXiv:2404.07112 (replaced) [pdf, html, other]
-
Title: Unfolding ADMM for Enhanced Subspace Clustering of Hyperspectral ImagesSubjects: Computer Vision and Pattern Recognition (cs.CV)
Deep subspace clustering methods are now prominent in clustering, typically using fully connected networks and a self-representation loss function. However, these methods often struggle with overfitting and lack interpretability. In this paper, we explore an alternative clustering approach based on deep unfolding. By unfolding iterative optimization methods into neural networks, this approach offers enhanced interpretability and reliability compared to data-driven deep learning methods, and greater adaptability and generalization than model-based approaches. Hence, unfolding has become widely used in inverse imaging problems, such as image restoration, reconstruction, and super-resolution, but has not been sufficiently explored yet in the context of clustering. In this work, we introduce an innovative clustering architecture for hyperspectral images (HSI) by unfolding an iterative solver based on the Alternating Direction Method of Multipliers (ADMM) for sparse subspace clustering. To our knowledge, this is the first attempt to apply unfolding ADMM for computing the self-representation matrix in subspace clustering. Moreover, our approach captures well the structural characteristics of HSI data by employing the K nearest neighbors algorithm as part of a structure preservation module. Experimental evaluation of three established HSI datasets shows clearly the potential of the unfolding approach in HSI clustering and even demonstrates superior performance compared to state-of-the-art techniques.
- [1175] arXiv:2404.08592 (replaced) [pdf, html, other]
-
Title: Scarce Resource Allocations That Rely On Machine Learning Should Be RandomizedComments: To appear in the proceedings of the International Conference on Machine Learning (ICML 2024)Subjects: Computers and Society (cs.CY)
Contrary to traditional deterministic notions of algorithmic fairness, this paper argues that fairly allocating scarce resources using machine learning often requires randomness. We address why, when, and how to randomize by proposing stochastic procedures that more adequately account for all of the claims that individuals have to allocations of social goods or opportunities.
- [1176] arXiv:2404.08676 (replaced) [pdf, html, other]
-
Title: ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red TeamingSimone Tedeschi, Felix Friedrich, Patrick Schramowski, Kristian Kersting, Roberto Navigli, Huu Nguyen, Bo LiComments: 17 pages, preprintSubjects: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
When building Large Language Models (LLMs), it is paramount to bear safety in mind and protect them with guardrails. Indeed, LLMs should never generate content promoting or normalizing harmful, illegal, or unethical behavior that may contribute to harm to individuals or society. This principle applies to both normal and adversarial use. In response, we introduce ALERT, a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It is designed to evaluate the safety of LLMs through red teaming methodologies and consists of more than 45k instructions categorized using our novel taxonomy. By subjecting LLMs to adversarial testing scenarios, ALERT aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models. Furthermore, the fine-grained taxonomy enables researchers to perform an in-depth evaluation that also helps one to assess the alignment with various policies. In our experiments, we extensively evaluate 10 popular open- and closed-source LLMs and demonstrate that many of them still struggle to attain reasonable levels of safety.
- [1177] arXiv:2404.09387 (replaced) [pdf, html, other]
-
Title: RankCLIP: Ranking-Consistent Language-Image PretrainingComments: 12 pages, 4 figures, 6 tables. Code and model checkpoints are available at this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RANKCLIP, a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By extending the traditional pair-wise loss to list-wise, and leveraging both in-modal and cross-modal ranking consistency, RANKCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the effectiveness of RANKCLIP in various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the importance of this enhanced learning process.
- [1178] arXiv:2404.10508 (replaced) [pdf, html, other]
-
Title: White Men Lead, Black Women Help? Benchmarking Language Agency Social Biases in LLMsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Language agency is an important aspect of evaluating social biases in texts. While several studies approached agency-related bias in human-written language, very limited research has investigated such biases in Large Language Model (LLM)-generated content. In addition, previous research often relies on string-matching techniques to identify agentic and communal words within texts, which fall short of accurately classifying language agency. We introduce the novel Language Agency Bias Evaluation (LABE) benchmark, which comprehensively evaluates biases in LLMs by analyzing agency levels attributed to different demographic groups in model generations. LABE leverages 5,400 template-based prompts, an accurate agency classifier, and corresponding bias metrics to test for gender, racial, and intersectional language agency biases in LLMs on 3 text generation tasks: biographies, professor reviews, and reference letters. To build better and more accurate automated agency classifiers, we also contribute and release the Language Agency Classification (LAC) dataset, consisting of 3,724 agentic and communal sentences. Using LABE, we unveil previously under-explored language agency social biases in 3 recent LLMs: ChatGPT, Llama3, and Mistral. We observe that: (1) For the same text category, LLM generations demonstrate higher levels of gender bias than human-written texts; (2) On most generation tasks, models show remarkably higher levels of intersectional bias than the other bias aspects. Those who are at the intersection of gender and racial minority groups -- such as Black females -- are consistently described by texts with lower levels of agency; (3) Among the 3 LLMs investigated, Llama3 demonstrates greatest overall bias in language agency; (4) Not only does prompt-based mitigation fail to resolve language agency bias in LLMs, but it frequently leads to the exacerbation of biases in generated texts.
- [1179] arXiv:2404.10595 (replaced) [pdf, html, other]
-
Title: Automated Evaluation of Large Vision-Language Models on Self-driving Corner CasesKai Chen, Yanze Li, Wenhua Zhang, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, Dit-Yan Yeung, Huchuan Lu, Xu JiaComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Large Vision-Language Models (LVLMs) have received widespread attention in advancing the interpretable self-driving. Existing evaluations of LVLMs primarily focus on the multi-faceted capabilities in natural circumstances, lacking automated and quantifiable assessment for self-driving, let alone the severe road corner cases. In this paper, we propose CODA-LM, the very first benchmark for the automatic evaluation of LVLMs for self-driving corner cases. We adopt a hierarchical data structure to prompt powerful LVLMs to analyze complex driving scenes and generate high-quality pre-annotation for human annotators, and for LVLM evaluation, we show that using the text-only large language models (LLMs) as judges reveals even better alignment with human preferences than the LVLM judges. Moreover, with CODA-LM, we build CODA-VLM, a new driving LVLM surpassing all the open-sourced counterparts on CODA-LM. Our CODA-VLM performs comparably with GPT-4V, even surpassing GPT-4V by +21.42% on the regional perception task. We hope CODA-LM can become the catalyst to promote interpretable self-driving empowered by LVLMs.
- [1180] arXiv:2404.11735 (replaced) [pdf, html, other]
-
Title: Learning with 3D rotations, a hitchhiker's guide to SO(3)Comments: Published at ICML 2024Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Many settings in machine learning require the selection of a rotation representation. However, choosing a suitable representation from the many available options is challenging. This paper acts as a survey and guide through rotation representations. We walk through their properties that harm or benefit deep learning with gradient-based optimization. By consolidating insights from rotation-based learning, we provide a comprehensive overview of learning functions with rotation representations. We provide guidance on selecting representations based on whether rotations are in the model's input or output and whether the data primarily comprises small angles.
- [1181] arXiv:2404.12933 (replaced) [pdf, html, other]
-
Title: Cross-cultural Inspiration Detection and Analysis in Real and LLM-generated Social Media DataSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Inspiration is linked to various positive outcomes, such as increased creativity, productivity, and happiness. Although inspiration has great potential, there has been limited effort toward identifying content that is inspiring, as opposed to just engaging or positive. Additionally, most research has concentrated on Western data, with little attention paid to other cultures. This work is the first to study cross-cultural inspiration through machine learning methods. We aim to identify and analyze real and AI-generated cross-cultural inspiring posts. To this end, we compile and make publicly available the InspAIred dataset, which consists of 2,000 real inspiring posts, 2,000 real non-inspiring posts, and 2,000 generated inspiring posts evenly distributed across India and the UK. The real posts are sourced from Reddit, while the generated posts are created using the GPT-4 model. Using this dataset, we conduct extensive computational linguistic analyses to (1) compare inspiring content across cultures, (2) compare AI-generated inspiring posts to real inspiring posts, and (3) determine if detection models can accurately distinguish between inspiring content across cultures and data sources.
- [1182] arXiv:2404.12938 (replaced) [pdf, html, other]
-
Title: MAiDE-up: Multilingual Deception Detection of GPT-generated Hotel ReviewsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Deceptive reviews are becoming increasingly common, especially given the increase in performance and the prevalence of LLMs. While work to date has addressed the development of models to differentiate between truthful and deceptive human reviews, much less is known about the distinction between real reviews and AI-authored fake reviews. Moreover, most of the research so far has focused primarily on English, with very little work dedicated to other languages. In this paper, we compile and make publicly available the MAiDE-up dataset, consisting of 10,000 real and 10,000 AI-generated fake hotel reviews, balanced across ten languages. Using this dataset, we conduct extensive linguistic analyses to (1) compare the AI fake hotel reviews to real hotel reviews, and (2) identify the factors that influence the deception detection model performance. We explore the effectiveness of several models for deception detection in hotel reviews across three main dimensions: sentiment, location, and language. We find that these dimensions influence how well we can detect AI-generated fake reviews.
- [1183] arXiv:2404.13049 (replaced) [pdf, html, other]
-
Title: DG-RePlAce: A Dataflow-Driven GPU-Accelerated Analytical Global Placement Framework for Machine Learning AcceleratorsSubjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
Global placement is a fundamental step in VLSI physical design. The wide use of 2D processing element (PE) arrays in machine learning accelerators poses new challenges of scalability and Quality of Results (QoR) for state-of-the-art academic global placers. In this work, we develop DG-RePlAce, a new and fast GPU-accelerated global placement framework built on top of the OpenROAD infrastructure, which exploits the inherent dataflow and datapath structures of machine learning accelerators. Experimental results with a variety of machine learning accelerators using a commercial 12nm enablement show that, compared with RePlAce (DREAMPlace), our approach achieves an average reduction in routed wirelength by 10% (7%) and total negative slack (TNS) by 31% (34%), with faster global placement and on-par total runtimes relative to DREAMPlace. Empirical studies on the TILOS MacroPlacement Benchmarks further demonstrate that post-route improvements over RePlAce and DREAMPlace may reach beyond the motivating application to machine learning accelerators.
- [1184] arXiv:2404.13878 (replaced) [pdf, html, other]
-
Title: Multi-Level Sequence Denoising with Cross-Signal Contrastive Learning for Sequential RecommendationSubjects: Information Retrieval (cs.IR)
Sequential recommender systems (SRSs) aim to suggest next item for a user based on her historical interaction sequences. Recently, many research efforts have been devoted to attenuate the influence of noisy items in sequences by either assigning them with lower attention weights or discarding them directly. The major limitation of these methods is that the former would still prone to overfit noisy items while the latter may overlook informative items. To the end, in this paper, we propose a novel model named Multi-level Sequence Denoising with Cross-signal Contrastive Learning (MSDCCL) for sequential recommendation. To be specific, we first introduce a target-aware user interest extractor to simultaneously capture users' long and short term interest with the guidance of target items. Then, we develop a multi-level sequence denoising module to alleviate the impact of noisy items by employing both soft and hard signal denoising strategies. Additionally, we extend existing curriculum learning by simulating the learning pattern of human beings. It is worth noting that our proposed model can be seamlessly integrated with a majority of existing recommendation models and significantly boost their effectiveness. Experimental studies on five public datasets are conducted and the results demonstrate that the proposed MSDCCL is superior to the state-of-the-art baselines. The source code is publicly available at this https URL.
- [1185] arXiv:2404.14994 (replaced) [pdf, other]
-
Title: Transformers Can Represent $n$-gram Language ModelsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
Existing work has analyzed the representational capacity of the transformer architecture by means of formal models of computation. However, the focus so far has been on analyzing the architecture in terms of language \emph{acceptance}. We contend that this is an ill-suited problem in the study of \emph{language models} (LMs), which are definitionally \emph{probability distributions} over strings. In this paper, we focus on the relationship between transformer LMs and $n$-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$-gram LM, giving us a concrete lower bound on their probabilistic representational capacity. This provides a first step towards understanding the mechanisms that transformer LMs can use to represent probability distributions over strings.
- [1186] arXiv:2404.15720 (replaced) [pdf, html, other]
-
Title: Annotator-Centric Active Learning for Subjective NLP TasksSubjects: Computation and Language (cs.CL)
Active Learning (AL) addresses the high costs of collecting human annotations by strategically annotating the most informative samples. However, for subjective NLP tasks, incorporating a wide range of perspectives in the annotation process is crucial to capture the variability in human judgments. We introduce Annotator-Centric Active Learning (ACAL), which incorporates an annotator selection strategy following data sampling. Our objective is two-fold: (1) to efficiently approximate the full diversity of human judgments, and (2) to assess model performance using annotator-centric metrics, which emphasize minority perspectives over a majority. We experiment with multiple annotator selection strategies across seven subjective NLP tasks, employing both traditional and novel, human-centered evaluation metrics. Our findings indicate that ACAL improves data efficiency and excels in annotator-centric performance evaluations. However, its success depends on the availability of a sufficiently large and diverse pool of annotators to sample from.
- [1187] arXiv:2404.16281 (replaced) [pdf, html, other]
-
Title: Timely Communications for Remote InferenceComments: This paper appears in: IEEE/ACM Transactions on Networking. arXiv admin note: text overlap with arXiv:2208.06948Subjects: Networking and Internet Architecture (cs.NI); Information Theory (cs.IT); Machine Learning (cs.LG)
In this paper, we analyze the impact of data freshness on remote inference systems, where a pre-trained neural network blue infers a time-varying target (e.g., the locations of vehicles and pedestrians) based on features (e.g., video frames) observed at a sensing node (e.g., a camera). One might expect that the performance of a remote inference system degrades monotonically as the feature becomes stale. Using an information-theoretic analysis, we show that this is true if the feature and target data sequence can be closely approximated as a Markov chain, whereas it is not true if the data sequence is far from being Markovian. Hence, the inference error is a function of Age of Information (AoI), where the function could be non-monotonic. To minimize the inference error in real-time, we propose a new "selection-from-buffer" model for sending the features, which is more general than the "generate-at-will" model used in earlier studies. In addition, we design low-complexity scheduling policies to improve inference performance. For single-source, single-channel systems, we provide an optimal scheduling policy. In multi-source, multi-channel systems, the scheduling problem becomes a multi-action restless multi-armed bandit problem. For this setting, we design a new scheduling policy by integrating Whittle index-based source selection and duality-based feature selection-from-buffer algorithms. This new scheduling policy is proven to be asymptotically optimal. These scheduling results hold for minimizing general AoI functions (monotonic or non-monotonic). Data-driven evaluations demonstrate the significant advantages of our proposed scheduling policies.
- [1188] arXiv:2404.17293 (replaced) [pdf, html, other]
-
Title: Lazy Data Practices Harm Fairness ResearchJournal-ref: FAccT '24: Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (2024) 642-659Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Applications (stat.AP); Machine Learning (stat.ML)
Data practices shape research and practice on fairness in machine learning (fair ML). Critical data studies offer important reflections and critiques for the responsible advancement of the field by highlighting shortcomings and proposing recommendations for improvement. In this work, we present a comprehensive analysis of fair ML datasets, demonstrating how unreflective yet common practices hinder the reach and reliability of algorithmic fairness findings. We systematically study protected information encoded in tabular datasets and their usage in 280 experiments across 142 publications.
Our analyses identify three main areas of concern: (1) a \textbf{lack of representation for certain protected attributes} in both data and evaluations; (2) the widespread \textbf{exclusion of minorities} during data preprocessing; and (3) \textbf{opaque data processing} threatening the generalization of fairness research. By conducting exemplary analyses on the utilization of prominent datasets, we demonstrate how unreflective data decisions disproportionately affect minority groups, fairness metrics, and resultant model comparisons. Additionally, we identify supplementary factors such as limitations in publicly available data, privacy considerations, and a general lack of awareness, which exacerbate these challenges. To address these issues, we propose a set of recommendations for data usage in fairness research centered on transparency and responsible inclusion. This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets. - [1189] arXiv:2404.17609 (replaced) [pdf, html, other]
-
Title: CoSD: Collaborative Stance Detection with Contrastive Heterogeneous Topic Graph LearningComments: 13 pagesSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Stance detection seeks to identify the viewpoints of individuals either in favor or against a given target or a controversial topic. Current advanced neural models for stance detection typically employ fully parametric softmax classifiers. However, these methods suffer from several limitations, including lack of explainability, insensitivity to the latent data structure, and unimodality, which greatly restrict their performance and applications. To address these challenges, we present a novel collaborative stance detection framework called (CoSD) which leverages contrastive heterogeneous topic graph learning to learn topic-aware semantics and collaborative signals among texts, topics, and stance labels for enhancing stance detection. During training, we construct a heterogeneous graph to structurally organize texts and stances through implicit topics via employing latent Dirichlet allocation. We then perform contrastive graph learning to learn heterogeneous node representations, aggregating informative multi-hop collaborative signals via an elaborate Collaboration Propagation Aggregation (CPA) module. During inference, we introduce a hybrid similarity scoring module to enable the comprehensive incorporation of topic-aware semantics and collaborative signals for stance detection. Extensive experiments on two benchmark datasets demonstrate the state-of-the-art detection performance of CoSD, verifying the effectiveness and explainability of our collaborative framework.
- [1190] arXiv:2404.17793 (replaced) [pdf, html, other]
-
Title: CLFT: Camera-LiDAR Fusion Transformer for Semantic Segmentation in Autonomous DrivingComments: Submitted to IEEE Transactions on Intelligent VehiclesSubjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Critical research about camera-and-LiDAR-based semantic object segmentation for autonomous driving significantly benefited from the recent development of deep learning. Specifically, the vision transformer is the novel ground-breaker that successfully brought the multi-head-attention mechanism to computer vision applications. Therefore, we propose a vision-transformer-based network to carry out camera-LiDAR fusion for semantic segmentation applied to autonomous driving. Our proposal uses the novel progressive-assemble strategy of vision transformers on a double-direction network and then integrates the results in a cross-fusion strategy over the transformer decoder layers. Unlike other works in the literature, our camera-LiDAR fusion transformers have been evaluated in challenging conditions like rain and low illumination, showing robust performance. The paper reports the segmentation results over the vehicle and human classes in different modalities: camera-only, LiDAR-only, and camera-LiDAR fusion. We perform coherent controlled benchmark experiments of CLFT against other networks that are also designed for semantic segmentation. The experiments aim to evaluate the performance of CLFT independently from two perspectives: multimodal sensor fusion and backbone architectures. The quantitative assessments show our CLFT networks yield an improvement of up to 10% for challenging dark-wet conditions when comparing with Fully-Convolutional-Neural-Network-based (FCN) camera-LiDAR fusion neural network. Contrasting to the network with transformer backbone but using single modality input, the all-around improvement is 5-10%.
- [1191] arXiv:2404.17864 (replaced) [pdf, html, other]
-
Title: Solvent: liquidity verification of smart contractsSubjects: Cryptography and Security (cs.CR); Programming Languages (cs.PL)
Smart contracts are an attractive target for attackers, as evidenced by a long history of security incidents. A current limitation of smart contract verification tools is that they are not really effective in expressing and verifying liquidity properties regarding the exchange of crypto-assets: for example, is it true that in every reachable state a user can fire a sequence of transactions to withdraw a given amount of crypto-assets? We propose Solvent, a tool aimed at verifying these kinds of properties, which are beyond the reach of existing verification tools for Solidity. We evaluate the effectiveness and performance of Solvent through a common benchmark of smart contracts.
- [1192] arXiv:2404.18083 (replaced) [pdf, html, other]
-
Title: Online,Target-Free LiDAR-Camera Extrinsic Calibration via Cross-Modal Mask MatchingComments: accepted to IEEE Trans. on Intelligent Vehicles (T-IV)Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
LiDAR-camera extrinsic calibration (LCEC) is crucial for data fusion in intelligent vehicles. Offline, target-based approaches have long been the preferred choice in this field. However, they often demonstrate poor adaptability to real-world environments. This is largely because extrinsic parameters may change significantly due to moderate shocks or during extended operations in environments with vibrations. In contrast, online, target-free approaches provide greater adaptability yet typically lack robustness, primarily due to the challenges in cross-modal feature matching. Therefore, in this article, we unleash the full potential of large vision models (LVMs), which are emerging as a significant trend in the fields of computer vision and robotics, especially for embodied artificial intelligence, to achieve robust and accurate online, target-free LCEC across a variety of challenging scenarios. Our main contributions are threefold: we introduce a novel framework known as MIAS-LCEC, provide an open-source versatile calibration toolbox with an interactive visualization interface, and publish three real-world datasets captured from various indoor and outdoor environments. The cornerstone of our framework and toolbox is the cross-modal mask matching (C3M) algorithm, developed based on a state-of-the-art (SoTA) LVM and capable of generating sufficient and reliable matches. Extensive experiments conducted on these real-world datasets demonstrate the robustness of our approach and its superior performance compared to SoTA methods, particularly for the solid-state LiDARs with super-wide fields of view.
- [1193] arXiv:2405.00303 (replaced) [pdf, html, other]
-
Title: Joint Optimization of Piecewise Linear EnsemblesComments: 7 pages, 4 figures, submitted to IEEE MLSP 2024Subjects: Machine Learning (cs.LG)
Tree ensembles achieve state-of-the-art performance on numerous prediction tasks. We propose Joint Optimization of Piecewise Linear ENsembles (JOPLEN), which jointly fits piecewise linear models at all leaf nodes of an existing tree ensemble. In addition to enhancing the expressiveness of an ensemble, JOPLEN allows several common penalties, including sparsity-promoting matrix norms and subspace-norms, to be applied to nonlinear prediction. We demonstrate the performance of JOPLEN on over 100 regression and classification datasets and with a variety of penalties. JOPLEN leads to improved prediction performance relative to not only standard random forest and gradient boosted tree ensembles, but also other methods for enhancing tree ensembles. We demonstrate that JOPLEN with a nuclear norm penalty learns subspace-aligned functions. Additionally, JOPLEN combined with a Dirty LASSO penalty is an effective feature selection method for nonlinear prediction in multitask learning.
- [1194] arXiv:2405.00495 (replaced) [pdf, html, other]
-
Title: The Loewner framework for parametric systems: Taming the curse of dimensionalityComments: 32 pages, 4 figuresSubjects: Numerical Analysis (math.NA); Systems and Control (eess.SY)
The Loewner framework is an interpolatory approach designed for approximating linear and nonlinear systems. The goal here is to extend this framework to linear parametric systems with an arbitrary number n of parameters. One main innovation established here is the construction of data-based realizations for any number of parameters. Equally importantly, we show how to alleviate the computational burden, by avoiding the explicit construction of large-scale n-dimensional Loewner matrices of size $N \times N$. This reduces the complexity from $O(N^3)$ to about $O(N^{1.4})$, thus taming the curse of dimensionality and making the solution scalable to very large data sets. To achieve this, a new generalized multivariate rational function realization is defined. Then, we introduce the n-dimensional multivariate Loewner matrices and show that they can be computed by solving a coupled set of Sylvester equations. The null space of these Loewner matrices then allows the construction of the multivariate barycentric transfer function. The principal result of this work is to show how the null space of the n-dimensional Loewner matrix can be computed using a sequence of 1-dimensional Loewner matrices, leading to a drastic computational burden reduction. Finally, we suggest two algorithms (one direct and one iterative) to construct, directly from data, multivariate (or parametric) realizations ensuring (approximate) interpolation. Numerical examples highlight the effectiveness and scalability of the method.
- [1195] arXiv:2405.00507 (replaced) [pdf, html, other]
-
Title: NeRF-Guided Unsupervised Learning of RGB-D RegistrationSubjects: Computer Vision and Pattern Recognition (cs.CV)
This paper focuses on training a robust RGB-D registration model without ground-truth pose supervision. Existing methods usually adopt a pairwise training strategy based on differentiable rendering, which enforces the photometric and the geometric consistency between the two registered frames as supervision. However, this frame-to-frame framework suffers from poor multi-view consistency due to factors such as lighting changes, geometry occlusion and reflective materials. In this paper, we present NeRF-UR, a novel frame-to-model optimization framework for unsupervised RGB-D registration. Instead of frame-to-frame consistency, we leverage the neural radiance field (NeRF) as a global model of the scene and use the consistency between the input and the NeRF-rerendered frames for pose optimization. This design can significantly improve the robustness in scenarios with poor multi-view consistency and provides better learning signal for the registration model. Furthermore, to bootstrap the NeRF optimization, we create a synthetic dataset, Sim-RGBD, through a photo-realistic simulator to warm up the registration model. By first training the registration model on Sim-RGBD and later unsupervisedly fine-tuning on real data, our framework enables distilling the capability of feature extraction and registration from simulation to reality. Our method outperforms the state-of-the-art counterparts on two popular indoor RGB-D datasets, ScanNet and 3DMatch. Code and models will be released for paper reproduction.
- [1196] arXiv:2405.00539 (replaced) [pdf, html, other]
-
Title: Data-driven approximation of Koopman operators and generators: Convergence rates and error boundsSubjects: Numerical Analysis (math.NA); Dynamical Systems (math.DS)
Global information about dynamical systems can be extracted by analysing associated infinite-dimensional transfer operators, such as Perron-Frobenius and Koopman operators as well as their infinitesimal generators. In practice, these operators typically need to be approximated from data. Popular approximation methods are extended dynamic mode decomposition (EDMD) and generator extended mode decomposition (gEDMD). We propose a unified framework that leverages Monte Carlo sampling to approximate the operator of interest on a finite-dimensional space spanned by a set of basis functions. Our framework contains EDMD and gEDMD as special cases, but can also be used to approximate more general operators. Our key contributions are proofs of the convergence of the approximating operator and its spectrum under non-restrictive conditions. Moreover, we derive explicit convergence rates and account for the presence of noise in the observations. Whilst all these results are broadly applicable, they also refine previous analyses of EDMD and gEDMD. We verify the analytical results with the aid of several numerical experiments.
- [1197] arXiv:2405.00860 (replaced) [pdf, other]
-
Title: Public Computing Intellectuals in the Age of AI CrisisComments: 28 pages, 2 tablesSubjects: Computers and Society (cs.CY)
The belief that AI technology is on the cusp of causing a generalized social crisis became a popular one in 2023. While there was no doubt an element of hype and exaggeration to some of these accounts, they do reflect the fact that there are troubling ramifications to this technology stack. This conjunction of shared concerns about social, political, and personal futures presaged by current developments in artificial intelligence presents the academic discipline of computing with a renewed opportunity for self-examination and reconfiguration. This position paper endeavors to do so in four sections. The first explores what is at stake for computing in the narrative of an AI crisis. The second articulates possible educational responses to this crisis and advocates for a broader analytic focus on power relations. The third section presents a novel characterization of academic computing's field of practice, one which includes not only the discipline's usual instrumental forms of practice but reflexive practice as well. This reflexive dimension integrates both the critical and public functions of the discipline as equal intellectual partners and a necessary component of any contemporary academic field. The final section will advocate for a conceptual archetype--the Public Computer Intellectual and its less conspicuous but still essential cousin, the (Almost) Public Computer Intellectual--as a way of practically imagining the expanded possibilities of academic practice in our discipline, one that provides both self-critique and an outward-facing orientation towards the public good. It will argue that the computer education research community can play a vital role in this regard. Recommendations for pedagogical change within computing to develop more reflexive capabilities are also provided.
- [1198] arXiv:2405.01943 (replaced) [pdf, html, other]
-
Title: Dependency-Aware Semi-Structured Sparsity: Declining Roles of Outliers in Pruning GLU-based LLMsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The rapid growth in the scale of Large Language Models (LLMs) has led to significant computational and memory costs, making model compression techniques such as network pruning increasingly crucial for their efficient deployment. Recent LLMs such as LLaMA2 and Mistral have adopted GLU-based MLP architectures. However, current LLM pruning strategies are primarily based on insights from older LLM architectures, necessitating a reevaluation of these strategies to suit the new architectural characteristics. Contrary to traditional beliefs, we find that outliers play a diminished role in the input projections of GLU-based MLPs. Leveraging this new insight, we propose Dependency-aware Semi-structured Sparsity (DaSS), a novel pruning method for GLU-based LLMs. DaSS balances the flexibility of unstructured pruning and the structural consistency of dependency-based structured pruning by considering both of weight magnitude and corresponding intermediate activation norms in weight pruning metric. Empirical evaluations on the Mistral, Gemma, and LLaMA2 model families demonstrate the consistent effectiveness of DaSS in the prevailing GLU variants.
- [1199] arXiv:2405.02612 (replaced) [pdf, html, other]
-
Title: Learning Linear Utility Functions From Pairwise Comparison QueriesComments: Submitted to ECAI for reviewSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML)
We study learnability of linear utility functions from pairwise comparison queries. In particular, we consider two learning objectives. The first objective is to predict out-of-sample responses to pairwise comparisons, whereas the second is to approximately recover the true parameters of the utility function. We show that in the passive learning setting, linear utilities are efficiently learnable with respect to the first objective, both when query responses are uncorrupted by noise, and under Tsybakov noise when the distributions are sufficiently "nice". In contrast, we show that utility parameters are not learnable for a large set of data distributions without strong modeling assumptions, even when query responses are noise-free. Next, we proceed to analyze the learning problem in an active learning setting. In this case, we show that even the second objective is efficiently learnable, and present algorithms for both the noise-free and noisy query response settings. Our results thus exhibit a qualitative learnability gap between passive and active learning from pairwise preference queries, demonstrating the value of the ability to select pairwise queries for utility learning.
- [1200] arXiv:2405.02881 (replaced) [pdf, html, other]
-
Title: FedConPE: Efficient Federated Conversational Bandits with Heterogeneous ClientsComments: Accepted to the 33rd International Joint Conference on Artificial Intelligence (IJCAI), 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Conversational recommender systems have emerged as a potent solution for efficiently eliciting user preferences. These systems interactively present queries associated with "key terms" to users and leverage user feedback to estimate user preferences more efficiently. Nonetheless, most existing algorithms adopt a centralized approach. In this paper, we introduce FedConPE, a phase elimination-based federated conversational bandit algorithm, where $M$ agents collaboratively solve a global contextual linear bandit problem with the help of a central server while ensuring secure data management. To effectively coordinate all the clients and aggregate their collected data, FedConPE uses an adaptive approach to construct key terms that minimize uncertainty across all dimensions in the feature space. Furthermore, compared with existing federated linear bandit algorithms, FedConPE offers improved computational and communication efficiency as well as enhanced privacy protections. Our theoretical analysis shows that FedConPE is minimax near-optimal in terms of cumulative regret. We also establish upper bounds for communication costs and conversation frequency. Comprehensive evaluations demonstrate that FedConPE outperforms existing conversational bandit algorithms while using fewer conversations.
- [1201] arXiv:2405.03371 (replaced) [pdf, html, other]
-
Title: Explainable Fake News Detection With Large Language Model via Defense Among Competing WisdomComments: 12 pages, WWW'2024Subjects: Computation and Language (cs.CL)
Most fake news detection methods learn latent feature representations based on neural networks, which makes them black boxes to classify a piece of news without giving any justification. Existing explainable systems generate veracity justifications from investigative journalism, which suffer from debunking delayed and low efficiency. Recent studies simply assume that the justification is equivalent to the majority opinions expressed in the wisdom of crowds. However, the opinions typically contain some inaccurate or biased information since the wisdom of crowds is uncensored. To detect fake news from a sea of diverse, crowded and even competing narratives, in this paper, we propose a novel defense-based explainable fake news detection framework. Specifically, we first propose an evidence extraction module to split the wisdom of crowds into two competing parties and respectively detect salient evidences. To gain concise insights from evidences, we then design a prompt-based module that utilizes a large language model to generate justifications by inferring reasons towards two possible veracities. Finally, we propose a defense-based inference module to determine veracity via modeling the defense among these justifications. Extensive experiments conducted on two real-world benchmarks demonstrate that our proposed method outperforms state-of-the-art baselines in terms of fake news detection and provides high-quality justifications.
- [1202] arXiv:2405.04304 (replaced) [pdf, html, other]
-
Title: Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language ModelsJonathan Mamou, Oren Pereg, Daniel Korat, Moshe Berchansky, Nadav Timor, Moshe Wasserblat, Roy SchwartzSubjects: Computation and Language (cs.CL)
Speculative decoding is commonly used for reducing the inference latency of large language models. Its effectiveness depends highly on the speculation lookahead (SL)-the number of tokens generated by the draft model at each iteration. In this work we show that the common practice of using the same SL for all iterations static SL is suboptimal. We introduce DISCO (DynamIc SpeCulation lookahead Optimization), a novel method for dynamically selecting the SL. Our experiments with four datasets show that DISCO reaches an average speedup of 10% compared to the best static SL baseline, while generating the exact same text.
- [1203] arXiv:2405.04434 (replaced) [pdf, html, other]
-
Title: DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language ModelDeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J.L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jin Chen, Jingyang Yuan, Junjie Qiu, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Lean Wang, Lecong Zhang, Lei Xu, Leyi Xia, Liang Zhao, Liyue Zhang, Meng Li, Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Ning Tian, Panpan Huang, Peiyi Wang, Peng Zhang, Qihao Zhu, Qinyu Chen, Qiushi Du, R.J. Chen, R.L. Jin, Ruiqi Ge, Ruizhe Pan, Runxin Xu, Ruyi Chen, S.S. Li, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaoqing Wu, Shengfeng Ye, Shirong Ma, Shiyu Wang, Shuang Zhou, Shuiping Yu, Shunfeng Zhou, Size Zheng, T. Wang, Tian Pei, Tian Yuan, Tianyu Sun, W.L. Xiao, Wangding Zeng, Wei An, Wen Liu, Wenfeng Liang, Wenjun Gao, Wentao Zhang, X.Q. Li, Xiangyue Jin, Xianzu Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaojin Shen, Xiaokang Chen, Xiaosha Chen, Xiaotao Nie, Xiaowen SunSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.
- [1204] arXiv:2405.04795 (replaced) [pdf, html, other]
-
Title: Variational Schr\"odinger Diffusion ModelsComments: ICML 2024Subjects: Machine Learning (cs.LG)
Schrödinger bridge (SB) has emerged as the go-to method for optimizing transportation plans in diffusion models. However, SB requires estimating the intractable forward score functions, inevitably resulting in the costly implicit training loss based on simulated trajectories. To improve the scalability while preserving efficient transportation plans, we leverage variational inference to linearize the forward score functions (variational scores) of SB and restore simulation-free properties in training backward scores. We propose the variational Schrödinger diffusion model (VSDM), where the forward process is a multivariate diffusion and the variational scores are adaptively optimized for efficient transport. Theoretically, we use stochastic approximation to prove the convergence of the variational scores and show the convergence of the adaptively generated samples based on the optimal variational scores. Empirically, we test the algorithm in simulated examples and observe that VSDM is efficient in generations of anisotropic shapes and yields straighter sample trajectories compared to the single-variate diffusion. We also verify the scalability of the algorithm in real-world data and achieve competitive unconditional generation performance in CIFAR10 and conditional generation in time series modeling. Notably, VSDM no longer depends on warm-up initializations and has become tuning-friendly in training large-scale experiments.
- [1205] arXiv:2405.05097 (replaced) [pdf, html, other]
-
Title: Biology-inspired joint distribution neurons based on Hierarchical Correlation Reconstruction allowing for multidirectional neural networksComments: 6 pages, 4 figuresSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Popular artificial neural networks (ANN) optimize parameters for unidirectional value propagation, assuming some arbitrary parametrization type like Multi-Layer Perceptron (MLP) or Kolmogorov-Arnold Network (KAN). In contrast, for biological neurons e.g. "it is not uncommon for axonal propagation of action potentials to happen in both directions"~\cite{axon} - suggesting they are optimized to continuously operate in multidirectional way. Additionally, statistical dependencies a single neuron could model is not just (expected) value dependence, but entire joint distributions including also higher moments. Such more agnostic joint distribution neuron would allow for multidirectional propagation (of distributions or values) e.g. $\rho(x|y,z)$ or $\rho(y,z|x)$ by substituting to $\rho(x,y,z)$ and normalizing. There will be discussed Hierarchical Correlation Reconstruction (HCR) for such neuron model: assuming $\rho(x,y,z)=\sum_{ijk} a_{ijk} f_i(x) f_j(y) f_k(z)$ type parametrization of joint distribution in polynomial basis $f_i$, which allows for flexible, inexpensive processing including nonlinearities, direct model estimation and update, trained through standard backpropagation or novel ways for such structure up to tensor decomposition or information bottleneck approach. Using only pairwise (input-output) dependencies, its expected value prediction becomes KAN-like with trained activation functions as polynomials, can be extended by adding higher order dependencies through included products - in conscious interpretable way, allowing for multidirectional propagation of both values and probability densities.
- [1206] arXiv:2405.05669 (replaced) [pdf, html, other]
-
Title: Passive Obstacle Aware Control to Follow Desired VelocitiesSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
Evaluating and updating the obstacle avoidance velocity for an autonomous robot in real-time ensures robustness against noise and disturbances. A passive damping controller can obtain the desired motion with a torque-controlled robot, which remains compliant and ensures a safe response to external perturbations. Here, we propose a novel approach for designing the passive control policy. Our algorithm complies with obstacle-free zones while transitioning to increased damping near obstacles to ensure collision avoidance. This approach ensures stability across diverse scenarios, effectively mitigating disturbances. Validation on a 7DoF robot arm demonstrates superior collision rejection capabilities compared to the baseline, underlining its practicality for real-world applications. Our obstacle-aware damping controller represents a substantial advancement in secure robot control within complex and uncertain environments.
- [1207] arXiv:2405.05955 (replaced) [pdf, html, other]
-
Title: Smurfs: Leveraging Multiple Proficiency Agents with Context-Efficiency for Tool PlanningSubjects: Computation and Language (cs.CL)
The emergence of large language models (LLMs) has opened up unprecedented possibilities for automating complex tasks that are often comparable to human performance. Despite their capabilities, LLMs still encounter difficulties in completing tasks that require high levels of accuracy and complexity due to their inherent limitations in handling multifaceted problems single-handedly. This paper introduces `Smurfs', a cutting-edge multi-agent framework designed to revolutionize the application of LLMs. By seamlessly transforming a conventional LLM into a synergistic multi-agent ensemble, Smurfs can enhance the model's ability to solve complex tasks at no additional cost. This is achieved through innovative prompting strategies that allocate distinct roles within the model, thereby facilitating collaboration among specialized agents and forming an intelligent multi-agent system. Our empirical investigation on both open-ended task of StableToolBench and closed-ended task on HotpotQA showcases Smurfs' superior capability in intricate tool utilization scenarios. Notably, Smurfs outmatches all the baseline methods in both experiments, setting new state-of-the-art performance. Furthermore, through comprehensive ablation studies, we dissect the contribution of the core components of the multi-agent framework to its overall efficacy. This not only verifies the effectiveness of the framework, but also sets a route for future exploration of multi-agent LLM systems.
- [1208] arXiv:2405.06258 (replaced) [pdf, html, other]
-
Title: Automatic Generation of Model and Data Cards: A Step Towards Responsible AIComments: NAACL 2024 (Oral)Subjects: Computation and Language (cs.CL)
In an era of model and data proliferation in machine learning/AI especially marked by the rapid advancement of open-sourced technologies, there arises a critical need for standardized consistent documentation. Our work addresses the information incompleteness in current human-generated model and data cards. We propose an automated generation approach using Large Language Models (LLMs). Our key contributions include the establishment of CardBench, a comprehensive dataset aggregated from over 4.8k model cards and 1.4k data cards, coupled with the development of the CardGen pipeline comprising a two-step retrieval process. Our approach exhibits enhanced completeness, objectivity, and faithfulness in generated model and data cards, a significant step in responsible AI documentation practices ensuring better accountability and traceability.
- [1209] arXiv:2405.06814 (replaced) [pdf, html, other]
-
Title: Dual-Task Vision Transformer for Rapid and Accurate Intracerebral Hemorrhage Classification on CT ImagesComments: 9 pages, 4 figure3Subjects: Computer Vision and Pattern Recognition (cs.CV)
Intracerebral hemorrhage (ICH) is a severe and sudden medical condition caused by the rupture of blood vessels in the brain, leading to permanent damage to brain tissue and often resulting in functional disabilities or death in patients. Diagnosis and analysis of ICH typically rely on brain CT imaging. Given the urgency of ICH conditions, early treatment is crucial, necessitating rapid analysis of CT images to formulate tailored treatment plans. However, the complexity of ICH CT images and the frequent scarcity of specialist radiologists pose significant challenges. Therefore, we built a dataset for ICH and normal classification and three types of ICH image classification based on the hemorrhage location, i.e., Deep, Subcortical, and Lobar. In addition, we propose a dual-task vision transformer (DTViT) for the automated classification and diagnosis of ICH images. This neural network utilizes the encoder from ViT, employing attention mechanisms for feature extraction from CT images. We incorporated two multilayer perception (MLP)-based decoders within the network to simultaneously identify the presence of ICH and classify three types of hemorrhage locations. Experimental results demonstrate that our proposed multi-classification network performs well on the built real-world test dataset. The code and dataset for this study will be made publicly available upon paper acceptance at: this https URL.
- [1210] arXiv:2405.07921 (replaced) [pdf, html, other]
-
Title: Can Better Text Semantics in Prompt Tuning Improve VLM Generalization?Hari Chandana Kuchibhotla, Sai Srinivas Kancheti, Abbavaram Gowtham Reddy, Vineeth N BalasubramanianSubjects: Computer Vision and Pattern Recognition (cs.CV)
Going beyond mere fine-tuning of vision-language models (VLMs), learnable prompt tuning has emerged as a promising, resource-efficient alternative. Despite their potential, effectively learning prompts faces the following challenges: (i) training in a low-shot scenario results in overfitting, limiting adaptability, and yielding weaker performance on newer classes or datasets; (ii) prompt-tuning's efficacy heavily relies on the label space, with decreased performance in large class spaces, signaling potential gaps in bridging image and class concepts. In this work, we investigate whether better text semantics can help address these concerns. In particular, we introduce a prompt-tuning method that leverages class descriptions obtained from Large Language Models (LLMs). These class descriptions are used to bridge image and text modalities. Our approach constructs part-level description-guided image and text features, which are subsequently aligned to learn more generalizable prompts. Our comprehensive experiments conducted across 11 benchmark datasets show that our method outperforms established methods, demonstrating substantial improvements.
- [1211] arXiv:2405.07975 (replaced) [pdf, html, other]
-
Title: Dynamic Programming for Symbolic Boolean Realizability and SynthesisComments: 32 pages including appendices and bibliography, 5 figures, paper is to be published in CAV 2024, but this version is inclusive of the AppendixSubjects: Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO)
Inspired by recent progress in dynamic programming approaches for weighted model counting, we investigate a dynamic-programming approach in the context of boolean realizability and synthesis, which takes a conjunctive-normal-form boolean formula over input and output variables, and aims at synthesizing witness functions for the output variables in terms of the inputs. We show how graded project-join trees, obtained via tree decomposition, can be used to compute a BDD representing the realizability set for the input formulas in a bottom-up order. We then show how the intermediate BDDs generated during realizability checking phase can be applied to synthesizing the witness functions in a top-down manner. An experimental evaluation of a solver -- DPSynth -- based on these ideas demonstrates that our approach for Boolean realizabilty and synthesis has superior time and space performance over a heuristics-based approach using same symbolic representations. We discuss the advantage on scalability of the new approach, and also investigate our findings on the performance of the DP framework.
- [1212] arXiv:2405.08760 (replaced) [pdf, html, other]
-
Title: Is the Pope Catholic? Yes, the Pope is Catholic. Generative Evaluation of Non-Literal Intent Resolution in LLMsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Humans often express their communicative intents indirectly or non-literally, which requires their interlocutors -- human or AI -- to understand beyond the literal meaning of words. While most existing work has focused on discriminative evaluations, we present a new approach to generatively evaluate large language models' (LLMs') intention understanding by examining their responses to non-literal utterances. Ideally, an LLM should respond in line with the true intention of a non-literal utterance, not its literal interpretation. Our findings show that LLMs struggle to generate pragmatically relevant responses to non-literal language, achieving only 50-55% accuracy on average. While explicitly providing oracle intentions significantly improves performance (e.g., 75% for Mistral-Instruct), this still indicates challenges in leveraging given intentions to produce appropriate responses. Using chain-of-thought to make models spell out intentions yields much smaller gains (60% for Mistral-Instruct). These findings suggest that LLMs are not yet effective pragmatic interlocutors, highlighting the need for better approaches for modeling intentions and utilizing them for pragmatic generation.
- [1213] arXiv:2405.09215 (replaced) [pdf, html, other]
-
Title: Xmodel-VLM: A Simple Baseline for Multimodal Vision Language ModelSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers. Our work directly confronts a pivotal industry issue by grappling with the prohibitive service costs that hinder the broad adoption of large-scale multimodal systems. Through rigorous training, we have developed a 1B-scale language model from the ground up, employing the LLaVA paradigm for modal alignment. The result, which we call Xmodel-VLM, is a lightweight yet powerful multimodal vision language model. Extensive testing across numerous classic multimodal benchmarks has revealed that despite its smaller size and faster execution, Xmodel-VLM delivers performance comparable to that of larger models. Our model checkpoints and code are publicly available on GitHub at this https URL.
- [1214] arXiv:2405.10027 (replaced) [pdf, html, other]
-
Title: The Real Price of Bandit Information in Multiclass ClassificationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
We revisit the classical problem of multiclass classification with bandit feedback (Kakade, Shalev-Shwartz and Tewari, 2008), where each input classifies to one of $K$ possible labels and feedback is restricted to whether the predicted label is correct or not. Our primary inquiry is with regard to the dependency on the number of labels $K$, and whether $T$-step regret bounds in this setting can be improved beyond the $\smash{\sqrt{KT}}$ dependence exhibited by existing algorithms. Our main contribution is in showing that the minimax regret of bandit multiclass is in fact more nuanced, and is of the form $\smash{\widetilde{\Theta}\left(\min \left\{|H| + \sqrt{T}, \sqrt{KT \log |H|} \right\} \right) }$, where $H$ is the underlying (finite) hypothesis class. In particular, we present a new bandit classification algorithm that guarantees regret $\smash{\widetilde{O}(|H|+\sqrt{T})}$, improving over classical algorithms for moderately-sized hypothesis classes, and give a matching lower bound establishing tightness of the upper bounds (up to log-factors) in all parameter regimes.
- [1215] arXiv:2405.10296 (replaced) [pdf, other]
-
Title: Verifying Unboundedness via AmalgamationComments: Erratum: Updated test for negative SUP instances in Section 4.1Subjects: Formal Languages and Automata Theory (cs.FL)
Well-structured transition systems (WSTS) are an abstract family of systems that encompasses a vast landscape of infinite-state systems. By requiring a well-quasi-ordering (wqo) on the set of states, a WSTS enables generic algorithms for classic verification tasks such as coverability and termination. However, even for systems that are WSTS like vector addition systems (VAS), the framework is notoriously ill-equipped to analyse reachability (as opposed to coverability). Moreover, some important types of infinite-state systems fall out of WSTS' scope entirely, such as pushdown systems (PDS).
Inspired by recent algorithmic techniques on VAS, we propose an abstract notion of systems where the set of runs is equipped with a wqo and supports amalgamation of runs. We show that it subsumes a large class of infinite-state systems, including (reachability languages of) VAS and PDS, and even all systems from the abstract framework of valence systems, except for those already known to be Turing-complete.
Moreover, this abstract setting enables simple and general algorithmic solutions to unboundedness problems, which have received much attention in recent years. We present algorithms for the (i) simultaneous unboundedness problem (which implies computability of downward closures and decidability of separability by piecewise testable languages), (ii) computing priority downward closures, (iii) deciding whether a language is bounded, meaning included in $w_1^*\cdots w_k^*$ for some words $w_1,\ldots,w_k$, and (iv) effective regularity of unary languages. This leads to either drastically simpler proofs or new decidability results for a rich variety of systems. - [1216] arXiv:2405.10979 (replaced) [pdf, html, other]
-
Title: Private Data Leakage in Federated Human Activity Recognition for Wearable Healthcare DevicesSubjects: Cryptography and Security (cs.CR)
Wearable data serves various health monitoring purposes, such as determining activity states based on user behavior and providing tailored exercise recommendations. However, the individual data perception and computational capabilities of wearable devices are limited, often necessitating the joint training of models across multiple devices. Federated Human Activity Recognition (HAR) presents a viable research avenue, allowing for global model training without the need to upload users' local activity data. Nonetheless, recent studies have revealed significant privacy concerns persisting within federated learning frameworks. To address this gap, we focus on investigating privacy leakage issues within federated user behavior recognition modeling across multiple wearable devices. Our proposed system entails a federated learning architecture comprising $N$ wearable device users and a parameter server, which may exhibit curiosity in extracting sensitive user information from model parameters. Consequently, we consider a membership inference attack based on a malicious server, leveraging differences in model generalization across client data. Experimentation conducted on five publicly available HAR datasets demonstrates an accuracy rate of 92\% for malicious server-based membership inference. Our study provides preliminary evidence of substantial privacy risks associated with federated training across multiple wearable devices, offering a novel research perspective within this domain.
- [1217] arXiv:2405.11032 (replaced) [pdf, html, other]
-
Title: Game-theoretic Energy Management Strategies With Interacting Agents in Formula 1Subjects: Systems and Control (eess.SY)
This paper presents an interaction-aware energy management optimization framework for Formula 1 racing. The considered scenario involves two agents and a drag reduction model. Strategic interactions between the agents are captured by a Stackelberg game formulated as a bilevel program. To address the computational challenges associated with bilevel optimization, the problem is reformulated as a single-level nonlinear program employing the Karush-Kuhn-Tucker conditions. The proposed framework contributes towards the development of new energy management and allocation strategies, caused by the presence of another agent. For instance, it provides valuable insights on how to redistribute the energy in order to optimally exploit the wake effect, showcasing a notable difference with the behavior studied in previous works. Robust energy allocations can be identified to reduce the lap time loss associated with unexpected choices of the other agent. It allows to recognize the boundary conditions for the interaction to become relevant, impacting the system's behavior, and to assess if overtaking is possible and beneficial. Overall, the framework provides a comprehensive approach for a two-agent Formula 1 racing problem with strategic interactions, offering physically intuitive and practical results.
- [1218] arXiv:2405.11357 (replaced) [pdf, html, other]
-
Title: Large Language Models Lack Understanding of Character Composition of WordsComments: To Appear at ICML 2024 Workshop on Large Language Models and CognitionSubjects: Computation and Language (cs.CL)
Large language models (LLMs) have demonstrated remarkable performances on a wide range of natural language tasks. Yet, LLMs' successes have been largely restricted to tasks concerning words, sentences, or documents, and it remains questionable how much they understand the minimal units of text, namely characters. In this paper, we examine contemporary LLMs regarding their ability to understand character composition of words, and show that most of them fail to reliably carry out even the simple tasks that can be handled by humans with perfection. We analyze their behaviors with comparison to token level performances, and discuss the potential directions for future research.
- [1219] arXiv:2405.11536 (replaced) [pdf, html, other]
-
Title: RobMOT: Robust 3D Multi-Object Tracking by Observational Noise and State Estimation Drift Mitigation on LiDAR PointCloudSubjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
This work addresses limitations in recent 3D tracking-by-detection methods, focusing on identifying legitimate trajectories and addressing state estimation drift in Kalman filters. Current methods rely heavily on threshold-based filtering of false positive detections using detection scores to prevent ghost trajectories. However, this approach is inadequate for distant and partially occluded objects, where detection scores tend to drop, potentially leading to false positives exceeding the threshold. Additionally, the literature generally treats detections as precise localizations of objects. Our research reveals that noise in detections impacts localization information, causing trajectory drift for occluded objects and hindering recovery. To this end, we propose a novel online track validity mechanism that temporally distinguishes between legitimate and ghost tracks, along with a multi-stage observational gating process for incoming observations. This mechanism significantly improves tracking performance, with a $6.28\%$ in HOTA and a $17.87\%$ increase in MOTA. We also introduce a refinement to the Kalman filter that enhances noise mitigation in trajectory drift, leading to more robust state estimation for occluded objects. Our framework, RobMOT, outperforms state-of-the-art methods, including deep learning approaches, across various detectors, achieving up to a $4\%$ margin in HOTA and $6\%$ in MOTA. RobMOT excels under challenging conditions, such as prolonged occlusions and tracking distant objects, with up to a 59\% improvement in processing latency.
- [1220] arXiv:2405.11976 (replaced) [pdf, html, other]
-
Title: Position-Guided Prompt Learning for Anomaly Detection in Chest X-RaysComments: MICCAI 2024 Early AcceptSubjects: Computer Vision and Pattern Recognition (cs.CV)
Anomaly detection in chest X-rays is a critical task. Most methods mainly model the distribution of normal images, and then regard significant deviation from normal distribution as anomaly. Recently, CLIP-based methods, pre-trained on a large number of medical images, have shown impressive performance on zero/few-shot downstream tasks. In this paper, we aim to explore the potential of CLIP-based methods for anomaly detection in chest X-rays. Considering the discrepancy between the CLIP pre-training data and the task-specific data, we propose a position-guided prompt learning method. Specifically, inspired by the fact that experts diagnose chest X-rays by carefully examining distinct lung regions, we propose learnable position-guided text and image prompts to adapt the task data to the frozen pre-trained CLIP-based model. To enhance the model's discriminative capability, we propose a novel structure-preserving anomaly synthesis method within chest x-rays during the training process. Extensive experiments on three datasets demonstrate that our proposed method outperforms some state-of-the-art methods. The code of our implementation is available at this https URL.
- [1221] arXiv:2405.12755 (replaced) [pdf, html, other]
-
Title: Progress Measures for Grokking on Real-world TasksComments: 5 pagesJournal-ref: ICML 2024 Workshop on High-dimensional Learning Dynamics (HiLD)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Grokking, a phenomenon where machine learning models generalize long after overfitting, has been primarily observed and studied in algorithmic tasks. This paper explores grokking in real-world datasets using deep neural networks for classification under the cross-entropy loss. We challenge the prevalent hypothesis that the $L_2$ norm of weights is the primary cause of grokking by demonstrating that grokking can occur outside the expected range of weight norms. To better understand grokking, we introduce three new progress measures: activation sparsity, absolute weight entropy, and approximate local circuit complexity. These measures are conceptually related to generalization and demonstrate a stronger correlation with grokking in real-world datasets compared to weight norms. Our findings suggest that while weight norms might usually correlate with grokking and our progress measures, they are not causative, and our proposed measures provide a better understanding of the dynamics of grokking.
- [1222] arXiv:2405.13068 (replaced) [pdf, other]
-
Title: Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level ManipulationSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Large language models (LLMs) have transformed the field of natural language processing, but they remain susceptible to jailbreaking attacks that exploit their capabilities to generate unintended and potentially harmful content. Existing token-level jailbreaking techniques, while effective, face scalability and efficiency challenges, especially as models undergo frequent updates and incorporate advanced defensive measures. In this paper, we introduce JailMine, an innovative token-level manipulation approach that addresses these limitations effectively. JailMine employs an automated "mining" process to elicit malicious responses from LLMs by strategically selecting affirmative outputs and iteratively reducing the likelihood of rejection. Through rigorous testing across multiple well-known LLMs and datasets, we demonstrate JailMine's effectiveness and efficiency, achieving a significant average reduction of 86% in time consumed while maintaining high success rates averaging 95%, even in the face of evolving defensive strategies. Our work contributes to the ongoing effort to assess and mitigate the vulnerability of LLMs to jailbreaking attacks, underscoring the importance of continued vigilance and proactive measures to enhance the security and reliability of these powerful language models.
- [1223] arXiv:2405.13179 (replaced) [pdf, html, other]
-
Title: RAG-RLRC-LaySum at BioLaySumm: Integrating Retrieval-Augmented Generation and Readability Control for Layman Summarization of Biomedical TextsYuelyu Ji, Zhuochun Li, Rui Meng, Sonish Sivarajkumar, Yanshan Wang, Zeshui Yu, Hui Ji, Yushui Han, Hanyu Zeng, Daqing HeSubjects: Computation and Language (cs.CL)
This paper introduces the RAG-RLRC-LaySum framework, designed to make complex biomedical research understandable to laymen through advanced Natural Language Processing (NLP) techniques. Our Retrieval Augmented Generation (RAG) solution, enhanced by a reranking method, utilizes multiple knowledge sources to ensure the precision and pertinence of lay summaries. Additionally, our Reinforcement Learning for Readability Control (RLRC) strategy improves readability, making scientific content comprehensible to non-specialists. Evaluations using the publicly accessible PLOS and eLife datasets show that our methods surpass Plain Gemini model, demonstrating a 20% increase in readability scores, a 15% improvement in ROUGE-2 relevance scores, and a 10% enhancement in factual accuracy. The RAG-RLRC-LaySum framework effectively democratizes scientific knowledge, enhancing public engagement with biomedical discoveries.
- [1224] arXiv:2405.13390 (replaced) [pdf, html, other]
-
Title: Convergence analysis of kernel learning FBSDE filterSubjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Mathematical Finance (q-fin.MF)
Kernel learning forward backward SDE filter is an iterative and adaptive meshfree approach to solve the nonlinear filtering problem. It builds from forward backward SDE for Fokker-Planker equation, which defines evolving density for the state variable, and employs KDE to approximate density. This algorithm has shown more superior performance than mainstream particle filter method, in both convergence speed and efficiency of solving high dimension problems.
However, this method has only been shown to converge empirically. In this paper, we present a rigorous analysis to demonstrate its local and global convergence, and provide theoretical support for its empirical results. - [1225] arXiv:2405.13929 (replaced) [pdf, html, other]
-
Title: Vikhr: The Family of Open-Source Instruction-Tuned Large Language Models for RussianSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
There has been a surge in the development of various Large Language Models (LLMs). However, text generation for languages other than English often faces significant challenges, including poor generation quality and the reduced computational performance due to the disproportionate representation of tokens in model's vocabulary. In this work, we address these issues and introduce Vikhr, a new state-of-the-art open-source instruction-tuned LLM designed specifically for the Russian language. Unlike previous efforts for Russian that utilize computationally inexpensive LoRA adapters on top of English-oriented models, Vikhr features an adapted tokenizer vocabulary and undergoes the continued pre-training and instruction tuning of all weights. This approach not only enhances the model's performance but also significantly improves its computational and contextual efficiency. The remarkable performance of Vikhr across various Russian-language benchmarks can also be attributed to our efforts in expanding instruction datasets and corpora for continued pre-training. Vikhr not only sets the new state of the art among open-source LLMs for Russian, but even outperforms some proprietary closed-source models on certain benchmarks. The model weights, instruction sets, and code are publicly available
- [1226] arXiv:2405.14154 (replaced) [pdf, html, other]
-
Title: Skip-SCAR: A Modular Approach to ObjectGoal Navigation with Sparsity and Adaptive SkipsComments: 13 pages, 6 figuresSubjects: Robotics (cs.RO)
In ObjectGoal navigation (ObjectNav), agents must locate specific objects within unseen environments, requiring effective observation, prediction, and navigation capabilities. This study found that traditional methods looking only for prediction accuracy often compromise on computational efficiency. To address this, we introduce "Skip-SCAR," a modular framework that enhances efficiency by leveraging sparsity and adaptive skips. The SparseConv-Augmented ResNet (SCAR) at the core of our approach uses sparse and dense feature processing in parallel, optimizing both the computation and memory footprint. Our adaptive skip technique further reduces computational demands by selectively bypassing unnecessary semantic segmentation steps based on environmental constancy. Tested on the HM3D ObjectNav datasets, Skip-SCAR not only minimizes resource use but also sets new performance benchmarks, demonstrating a robust method for improving efficiency and accuracy in robotic navigation tasks.
- [1227] arXiv:2405.15082 (replaced) [pdf, html, other]
-
Title: Advancements in Translation Accuracy for Stereo Visual-Inertial InitializationSubjects: Robotics (cs.RO)
As the current initialization method in the state-of-the-art Stereo Visual-Inertial SLAM framework, ORB-SLAM3 has limitations. Its success depends on the performance of the pure stereo SLAM system and is based on the underlying assumption that pure visual SLAM can accurately estimate the camera trajectory, which is essential for inertial parameter estimation. Meanwhile, the further improved initialization method for ORB-SLAM3, known as Stereo-NEC, is time-consuming due to applying keypoint tracking to estimate gyroscope bias with normal epipolar constraints. To address the limitations of previous methods, this paper proposes a method aimed at enhancing translation accuracy during the initialization stage. The fundamental concept of our method is to improve the translation estimate with a 3 Degree-of-Freedom (DoF) Bundle Adjustment (BA), independently, while the rotation estimate is fixed, instead of using ORB-SLAM3's 6-DoF BA. Additionally, the rotation estimate will be updated by considering IMU measurements and gyroscope bias, unlike ORB-SLAM3's rotation, which is directly obtained from stereo visual odometry and may yield inferior results when operating in challenging scenarios. We also conduct extensive evaluations on the public benchmark, the EuRoC dataset, demonstrating that our method excels in accuracy.
- [1228] arXiv:2405.15253 (replaced) [pdf, other]
-
Title: Seeing the World through an Antenna's Eye: Reception Quality Visualization Using Incomplete Technical Signal InformationComments: 5 pages, to be published in the conference proceedings of the European Signal Processing Conference (EUSIPCO) 2024, camera-ready version adding information on the space and time complexity, as well as the dataset sizeSubjects: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA); Data Analysis, Statistics and Probability (physics.data-an)
We come up with a novel application for image analysis methods in the context of direction dependent signal characteristics. For this purpose, we describe an inpainting approach adding benefit to technical signal information which are typically only used for monitoring and control purposes in ground station operations. Recalling the theoretical properties of the employed inpainting technique and appropriate modeling allow us to demonstrate the usefulness of our approach for satellite data reception quality assessment. In our application, we show the advantages of inpainting products over raw data as well as the rich potential of the visualization of technical signal information.
- [1229] arXiv:2405.15377 (replaced) [pdf, other]
-
Title: Dynamic Planning for Sequential Whole-body Mobile ManipulationComments: technical issueSubjects: Robotics (cs.RO)
The dynamic Sequential Mobile Manipulation Planning (SMMP) framework is essential for the safe and robust operation of mobile manipulators in dynamic environments. Previous research has primarily focused on either motion-level or task-level dynamic planning, with limitations in handling state changes that have long-term effects or in generating responsive motions for diverse tasks, respectively. This paper presents a holistic dynamic planning framework that extends the Virtual Kinematic Chain (VKC)-based SMMP method, automating dynamic long-term task planning and reactive whole-body motion generation for SMMP problems. The framework consists of an online task planning module designed to respond to environment changes with long-term effects, a VKC-based whole-body motion planning module for manipulating both rigid and articulated objects, alongside a reactive Model Predictive Control (MPC) module for obstacle avoidance during execution. Simulations and real-world experiments validate the framework, demonstrating its efficacy and validity across sequential mobile manipulation tasks, even in scenarios involving human interference.
- [1230] arXiv:2405.16090 (replaced) [pdf, html, other]
-
Title: EEG-DBNet: A Dual-Branch Network for Temporal-Spectral Decoding in Motor-Imagery Brain-Computer InterfacesSubjects: Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
Motor imagery electroencephalogram (EEG)-based brain-computer interfaces (BCIs) offer significant advantages for individuals with restricted limb mobility. However, challenges such as low signal-to-noise ratio and limited spatial resolution impede accurate feature extraction from EEG signals, thereby affecting the classification accuracy of different actions. To address these challenges, this study proposes an end-to-end dual-branch network (EEG-DBNet) that decodes the temporal and spectral sequences of EEG signals in parallel through two distinct network branches. Each branch comprises a local convolutional block and a global convolutional block. The local convolutional block transforms the source signal from the temporal-spatial domain to the temporal-spectral domain. By varying the number of filters and convolution kernel sizes, the local convolutional blocks in different branches adjust the length of their respective dimension sequences. Different types of pooling layers are then employed to emphasize the features of various dimension sequences, setting the stage for subsequent global feature extraction. The global convolution block splits and reconstructs the feature of the signal sequence processed by the local convolution block in the same branch and further extracts features through the dilated causal convolutional neural networks. Finally, the outputs from the two branches are concatenated, and signal classification is completed via a fully connected layer. Our proposed method achieves classification accuracies of 85.84% and 91.60% on the BCI Competition 4-2a and BCI Competition 4-2b datasets, respectively, surpassing existing state-of-the-art models. The source code is available at this https URL.
- [1231] arXiv:2405.16127 (replaced) [pdf, html, other]
-
Title: Finetuning Large Language Model for Personalized RankingSubjects: Information Retrieval (cs.IR)
Large Language Models (LLMs) have demonstrated remarkable performance across various domains, motivating researchers to investigate their potential use in recommendation systems. However, directly applying LLMs to recommendation tasks has proven challenging due to the significant disparity between the data used for pre-training LLMs and the specific requirements of recommendation tasks. In this study, we introduce Direct Multi-Preference Optimization (DMPO), a streamlined framework designed to bridge the gap and enhance the alignment of LLMs for recommendation tasks. DMPO enhances the performance of LLM-based recommenders by simultaneously maximizing the probability of positive samples and minimizing the probability of multiple negative samples. We conducted experimental evaluations to compare DMPO against traditional recommendation methods and other LLM-based recommendation approaches. The results demonstrate that DMPO significantly improves the recommendation capabilities of LLMs across three real-world public datasets in few-shot scenarios. Additionally, the experiments indicate that DMPO exhibits superior generalization ability in cross-domain recommendations. A case study elucidates the reasons behind these consistent improvements and also underscores DMPO's potential as an explainable recommendation system.
- [1232] arXiv:2405.16153 (replaced) [pdf, html, other]
-
Title: DefSent+: Improving sentence embeddings of language models by projecting definition sentences into a quasi-isotropic or isotropic vector space of unlimited dictionary entriesSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
This paper presents a significant improvement on the previous conference paper known as DefSent. The prior study seeks to improve sentence embeddings of language models by projecting definition sentences into the vector space of dictionary entries. We discover that this approach is not fully explored due to the methodological limitation of using word embeddings of language models to represent dictionary entries. This leads to two hindrances. First, dictionary entries are constrained by the single-word vocabulary, and thus cannot be fully exploited. Second, semantic representations of language models are known to be anisotropic, but pre-processing word embeddings for DefSent is not allowed because its weight is frozen during training and tied to the prediction layer. In this paper, we propose a novel method to progressively build entry embeddings not subject to the limitations. As a result, definition sentences can be projected into a quasi-isotropic or isotropic vector space of unlimited dictionary entries, so that sentence embeddings of noticeably better quality are attainable. We abbreviate our approach as DefSent+ (a plus version of DefSent), involving the following strengths: 1) the task performance on measuring sentence similarities is significantly improved compared to DefSent; 2) when DefSent+ is used to further train data-augmented models like SIMCSE, SNCSE, and SynCSE, state-of-the-art performance on measuring sentence similarities can be achieved among the approaches without using manually labeled datasets; 3) DefSent+ is also competitive in feature-based transfer for NLP downstream tasks.
- [1233] arXiv:2405.16434 (replaced) [pdf, html, other]
-
Title: The Importance of Directional Feedback for LLM-based OptimizersComments: Accepted and Presented at Foundation Models for Decision Making at NeurIPS 2023 (December 15, 2023). Work completed from June 2023 to September 2023Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
We study the potential of using large language models (LLMs) as an interactive optimizer for solving maximization problems in a text space using natural language and numerical feedback. Inspired by the classical optimization literature, we classify the natural language feedback into directional and non-directional, where the former is a generalization of the first-order feedback to the natural language space. We find that LLMs are especially capable of optimization when they are provided with {directional feedback}. Based on this insight, we design a new LLM-based optimizer that synthesizes directional feedback from the historical optimization trace to achieve reliable improvement over iterations. Empirically, we show our LLM-based optimizer is more stable and efficient in solving optimization problems, from maximizing mathematical functions to optimizing prompts for writing poems, compared with existing techniques.
- [1234] arXiv:2405.16883 (replaced) [pdf, html, other]
-
Title: Scorch: A Library for Sparse Deep LearningComments: 25 pages, 8 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Mathematical Software (cs.MS); Programming Languages (cs.PL)
The rapid growth in the size of deep learning models strains the capabilities of traditional dense computation paradigms. Leveraging sparse computation has become increasingly popular for training and deploying large-scale models, but existing deep learning frameworks lack extensive support for sparse operations. To bridge this gap, we introduce Scorch, a library that seamlessly integrates efficient sparse tensor computation into the PyTorch ecosystem, with an initial focus on inference workloads on CPUs. Scorch provides a flexible and intuitive interface for sparse tensors, supporting diverse sparse data structures. Scorch introduces a compiler stack that automates key optimizations, including automatic loop ordering, tiling, and format inference. Combined with a runtime that adapts its execution to both dense and sparse data, Scorch delivers substantial speedups over hand-written PyTorch Sparse (torch.sparse) operations without sacrificing usability. More importantly, Scorch enables efficient computation of complex sparse operations that lack hand-optimized PyTorch implementations. This flexibility is crucial for exploring novel sparse architectures. We demonstrate Scorch's ease of use and performance gains on diverse deep learning models across multiple domains. With only minimal code changes, Scorch achieves 1.05-5.78x speedups over PyTorch Sparse on end-to-end tasks. Scorch's seamless integration and performance gains make it a valuable addition to the PyTorch ecosystem. We believe Scorch will enable wider exploration of sparsity as a tool for scaling deep learning and inform the development of other sparse libraries.
- [1235] arXiv:2405.16895 (replaced) [pdf, html, other]
-
Title: Anonymization Prompt Learning for Facial Privacy-Preserving Text-to-Image GenerationComments: 15 pages, 8 figures and 5 tablesSubjects: Computer Vision and Pattern Recognition (cs.CV)
Text-to-image diffusion models, such as Stable Diffusion, generate highly realistic images from text descriptions. However, the generation of certain content at such high quality raises concerns. A prominent issue is the accurate depiction of identifiable facial images, which could lead to malicious deepfake generation and privacy violations. In this paper, we propose Anonymization Prompt Learning (APL) to address this problem. Specifically, we train a learnable prompt prefix for text-to-image diffusion models, which forces the model to generate anonymized facial identities, even when prompted to produce images of specific individuals. Extensive quantitative and qualitative experiments demonstrate the successful anonymization performance of APL, which anonymizes any specific individuals without compromising the quality of non-identity-specific image generation. Furthermore, we reveal the plug-and-play property of the learned prompt prefix, enabling its effective application across different pretrained text-to-image models for transferrable privacy and security protection against the risks of deepfakes.
- [1236] arXiv:2405.16902 (replaced) [pdf, html, other]
-
Title: Predicting from a Different Perspective: A Re-ranking Model for Inductive Knowledge Graph CompletionComments: 12 pages, 2 figuresSubjects: Machine Learning (cs.LG)
Rule-induction models have demonstrated great power in the inductive setting of knowledge graph completion. In this setting, the models are tested on a knowledge graph entirely composed of unseen entities. These models learn relation patterns as rules by utilizing subgraphs. Providing the same inputs with different rules leads to differences in the model's predictions. In this paper, we focus on the behavior of such models. We propose a re-ranking-based model called ReDistLP (Re-ranking with a Distinct Model for Link Prediction). This model enhances the effectiveness of re-ranking by leveraging the difference in the predictions between the initial retriever and the re-ranker. ReDistLP outperforms the state-of-the-art methods in 2 out of 3 benchmarks.
- [1237] arXiv:2405.17152 (replaced) [pdf, html, other]
-
Title: CoSLight: Co-optimizing Collaborator Selection and Decision-making to Enhance Traffic Signal ControlComments: Accepted by KDD 2024Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Effective multi-intersection collaboration is pivotal for reinforcement-learning-based traffic signal control to alleviate congestion. Existing work mainly chooses neighboring intersections as collaborators. However, quite an amount of congestion, even some wide-range congestion, is caused by non-neighbors failing to collaborate. To address these issues, we propose to separate the collaborator selection as a second policy to be learned, concurrently being updated with the original signal-controlling policy. Specifically, the selection policy in real-time adaptively selects the best teammates according to phase- and intersection-level features. Empirical results on both synthetic and real-world datasets provide robust validation for the superiority of our approach, offering significant improvements over existing state-of-the-art methods. The code is available at this https URL.
- [1238] arXiv:2405.17381 (replaced) [pdf, html, other]
-
Title: Various Lengths, Constant Speed: Efficient Language Modeling with Lightning AttentionComments: Accepted by ICML 2024. Yiran Zhong is the corresponding author. Code is released at this http URLSubjects: Computation and Language (cs.CL)
We present Lightning Attention, the first linear attention implementation that maintains a constant training speed for various sequence lengths under fixed memory consumption. Due to the issue with cumulative summation operations (cumsum), previous linear attention implementations cannot achieve their theoretical advantage in a casual setting. However, this issue can be effectively solved by utilizing different attention calculation strategies to compute the different parts of attention. Specifically, we split the attention calculation into intra-blocks and inter-blocks and use conventional attention computation for intra-blocks and linear attention kernel tricks for inter-blocks. This eliminates the need for cumsum in the linear attention calculation. Furthermore, a tiling technique is adopted through both forward and backward procedures to take full advantage of the GPU hardware. To enhance accuracy while preserving efficacy, we introduce TransNormerLLM (TNL), a new architecture that is tailored to our lightning attention. We conduct rigorous testing on standard and self-collected datasets with varying model sizes and sequence lengths. TNL is notably more efficient than other language models. In addition, benchmark results indicate that TNL performs on par with state-of-the-art LLMs utilizing conventional transformer structures. The source code is released at this http URL.
- [1239] arXiv:2405.17473 (replaced) [pdf, html, other]
-
Title: Repeat-Aware Neighbor Sampling for Dynamic Graph LearningComments: Accepted by KDD 2024, Research TrackSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
Dynamic graph learning equips the edges with time attributes and allows multiple links between two nodes, which is a crucial technology for understanding evolving data scenarios like traffic prediction and recommendation systems. Existing works obtain the evolving patterns mainly depending on the most recent neighbor sequences. However, we argue that whether two nodes will have interaction with each other in the future is highly correlated with the same interaction that happened in the past. Only considering the recent neighbors overlooks the phenomenon of repeat behavior and fails to accurately capture the temporal evolution of interactions. To fill this gap, this paper presents RepeatMixer, which considers evolving patterns of first and high-order repeat behavior in the neighbor sampling strategy and temporal information learning. Firstly, we define the first-order repeat-aware nodes of the source node as the destination nodes that have interacted historically and extend this concept to high orders as nodes in the destination node's high-order neighbors. Then, we extract neighbors of the source node that interacted before the appearance of repeat-aware nodes with a slide window strategy as its neighbor sequence. Next, we leverage both the first and high-order neighbor sequences of source and destination nodes to learn temporal patterns of interactions via an MLP-based encoder. Furthermore, considering the varying temporal patterns on different orders, we introduce a time-aware aggregation mechanism that adaptively aggregates the temporal representations from different orders based on the significance of their interaction time sequences. Experimental results demonstrate the superiority of RepeatMixer over state-of-the-art models in link prediction tasks, underscoring the effectiveness of the proposed repeat-aware neighbor sampling strategy.
- [1240] arXiv:2405.17665 (replaced) [pdf, html, other]
-
Title: Enhanced Robot Arm at the Edge with NLP and Vision SystemsSubjects: Robotics (cs.RO)
This paper introduces a "proof of concept" for a new approach to assistive robotics, integrating edge computing with Natural Language Processing (NLP) and computer vision to enhance the interaction between humans and robotic systems. Our "proof of concept" demonstrates the feasibility of using large language models (LLMs) and vision systems in tandem for interpreting and executing complex commands conveyed through natural language. This integration aims to improve the intuitiveness and accessibility of assistive robotic systems, making them more adaptable to the nuanced needs of users with disabilities. By leveraging the capabilities of edge computing, our system has the potential to minimize latency and support offline capability, enhancing the autonomy and responsiveness of assistive robots. Experimental results from our implementation on a robotic arm show promising outcomes in terms of accurate intent interpretation and object manipulation based on verbal commands. This research lays the groundwork for future developments in assistive robotics, focusing on creating highly responsive, user-centric systems that can significantly improve the quality of life for individuals with disabilities.
- [1241] arXiv:2405.17670 (replaced) [pdf, html, other]
-
Title: Deployment of NLP and LLM Techniques to Control Mobile Robots at the Edge: A Case Study Using GPT-4-Turbo and LLaMA 2Pascal Sikorski, Leendert Schrader, Kaleb Yu, Lucy Billadeau, Jinka Meenakshi, Naveena Mutharasan, Flavio Esposito, Hadi AliAkbarpour, Madi BabaiaslSubjects: Robotics (cs.RO)
This paper investigates the possibility of intuitive human-robot interaction through the application of Natural Language Processing (NLP) and Large Language Models (LLMs) in mobile robotics. We aim to explore the feasibility of using these technologies for edge-based deployment, where traditional cloud dependencies are eliminated. The study specifically contrasts the performance of GPT-4-Turbo, which requires cloud connectivity, with an offline-capable, quantized version of LLaMA 2 (LLaMA 2-7B.Q5 K M). Our results show that GPT-4-Turbo delivers superior performance in interpreting and executing complex commands accurately, whereas LLaMA 2 exhibits significant limitations in consistency and reliability of command execution. Communication between the control computer and the mobile robot is established via a Raspberry Pi Pico W, which wirelessly receives commands from the computer without internet dependency and transmits them through a wired connection to the robot's Arduino controller. This study highlights the potential and challenges of implementing LLMs and NLP at the edge, providing groundwork for future research into fully autonomous and network-independent robotic systems. For video demonstrations and source code, please refer to: this https URL.
- [1242] arXiv:2405.18016 (replaced) [pdf, html, other]
-
Title: On Creativity and Open-EndednessComments: 11 pages, accepted for publication in the proceedings of the 2024 International Conference for Artificial Life, Copenhagen, DenmarkSubjects: Artificial Intelligence (cs.AI)
Artificial Life (ALife) as an interdisciplinary field draws inspiration and influence from a variety of perspectives. Scientific progress crucially depends, then, on concerted efforts to invite cross-disciplinary dialogue. The goal of this paper is to revitalize discussions of potential connections between the fields of Computational Creativity (CC) and ALife, focusing specifically on the concept of Open-Endedness (OE); the primary goal of CC is to endow artificial systems with creativity, and ALife has dedicated much research effort into studying and synthesizing OE and artificial innovation. However, despite the close proximity of these concepts, their use so far remains confined to their respective communities, and their relationship is largely unclear. We provide historical context for research in both domains, and review the limited work connecting research on creativity and OE explicitly. We then highlight specific questions to be considered, with the eventual goals of (i) decreasing conceptual ambiguity by highlighting similarities and differences between the concepts of OE and creativity, (ii) identifying synergy effects of a research agenda that encompasses both concepts, and (iii) establishing a dialogue between ALife and CC research.
- [1243] arXiv:2405.18259 (replaced) [pdf, html, other]
-
Title: Ranking with Ties based on Noisy Performance DataComments: 22 pages, 21 figuresSubjects: Performance (cs.PF); Information Retrieval (cs.IR)
We consider the problem of ranking a set of objects based on their performance when the measurement of said performance is subject to noise. In this scenario, the performance is measured repeatedly, resulting in a range of measurements for each object. If the ranges of two objects do not overlap, then we consider one object as 'better' than the other, and we expect it to receive a higher rank; if, however, the ranges overlap, then the objects are incomparable, and we wish them to be assigned the same rank. Unfortunately, the incomparability relation of ranges is in general not transitive; as a consequence, in general the two requirements cannot be satisfied simultaneously, i.e., it is not possible to guarantee both distinct ranks for objects with separated ranges, and same rank for objects with overlapping ranges. This conflict leads to more than one reasonable way to rank a set of objects. In this paper, we explore the ambiguities that arise when ranking with ties, and define a set of reasonable rankings, which we call partial rankings. We develop and analyse three different methodologies to compute a partial ranking. Finally, we show how performance differences among objects can be investigated with the help of partial ranking.
- [1244] arXiv:2405.19660 (replaced) [pdf, html, other]
-
Title: PATIENT-{\Psi}: Using Large Language Models to Simulate Patients for Training Mental Health ProfessionalsRuiyi Wang, Stephanie Milani, Jamie C. Chiu, Jiayin Zhi, Shaun M. Eack, Travis Labrum, Samuel M. Murphy, Nev Jones, Kate Hardy, Hong Shen, Fei Fang, Zhiyu Zoey ChenComments: 9 pages, 5 figuresSubjects: Computation and Language (cs.CL)
Mental illness remains one of the most critical public health issues. Despite its importance, many mental health professionals highlight a disconnect between their training and actual real-world patient practice. To help bridge this gap, we propose PATIENT-{\Psi}, a novel patient simulation framework for cognitive behavior therapy (CBT) training. To build PATIENT-{\Psi}, we construct diverse patient cognitive models based on CBT principles and use large language models (LLMs) programmed with these cognitive models to act as a simulated therapy patient. We propose an interactive training scheme, PATIENT-{\Psi}-TRAINER, for mental health trainees to practice a key skill in CBT -- formulating the cognitive model of the patient -- through role-playing a therapy session with PATIENT-{\Psi}. To evaluate PATIENT-{\Psi}, we conducted a comprehensive user study of 13 mental health trainees and 20 experts. The results demonstrate that practice using PATIENT-{\Psi}-TRAINER enhances the perceived skill acquisition and confidence of the trainees beyond existing forms of training such as textbooks, videos, and role-play with non-patients. Based on the experts' perceptions, PATIENT-{\Psi} is perceived to be closer to real patient interactions than GPT-4, and PATIENT-{\Psi}-TRAINER holds strong promise to improve trainee competencies. Our code and data are released at \url{this https URL}.
- [1245] arXiv:2405.19846 (replaced) [pdf, html, other]
-
Title: Quest: Query-centric Data Synthesis Approach for Long-context Scaling of Large Language ModelSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large language models, initially pre-trained with a limited context length, can better handle longer texts by continuing training on a corpus with extended contexts. However, obtaining effective long-context data is challenging due to the scarcity and uneven distribution of long documents across different domains. To address this issue, we propose a Query-centric data synthesis method, abbreviated as Quest. Quest is an interpretable method based on the observation that documents retrieved by similar queries are relevant but low-redundant, thus well-suited for synthesizing long-context data. The method is also scalable and capable of constructing large amounts of long-context data. Using Quest, we synthesize a long-context dataset up to 128k context length, significantly outperforming other data synthesis methods on multiple long-context benchmark datasets. In addition, we further verify that the Quest method is predictable through scaling law experiments, making it a reliable solution for advancing long-context models.
- [1246] arXiv:2405.20109 (replaced) [pdf, html, other]
-
Title: FMARS: Annotating Remote Sensing Images for Disaster Management using Foundation ModelsEdoardo Arnaudo, Jacopo Lungo Vaschetti, Lorenzo Innocenti, Luca Barco, Davide Lisi, Vanina Fissore, Claudio RossiComments: Accepted at IGARSS 2024, 5 pages. Revised and corrected versionSubjects: Computer Vision and Pattern Recognition (cs.CV)
Very-High Resolution (VHR) remote sensing imagery is increasingly accessible, but often lacks annotations for effective machine learning applications. Recent foundation models like GroundingDINO and Segment Anything (SAM) provide opportunities to automatically generate annotations. This study introduces FMARS (Foundation Model Annotations in Remote Sensing), a methodology leveraging VHR imagery and foundation models for fast and robust annotation. We focus on disaster management and provide a large-scale dataset with labels obtained from pre-event imagery over 19 disaster events, derived from the Maxar Open Data initiative. We train segmentation models on the generated labels, using Unsupervised Domain Adaptation (UDA) techniques to increase transferability to real-world scenarios. Our results demonstrate the effectiveness of leveraging foundation models to automatically annotate remote sensing data at scale, enabling robust downstream models for critical applications. Code and dataset are available at \url{this https URL}.
- [1247] arXiv:2405.20231 (replaced) [pdf, html, other]
-
Title: The Empirical Impact of Neural Parameter Symmetries, or Lack ThereofComments: 27 pages. Preparing code for release. v2: added / updated some citationsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Many algorithms and observed phenomena in deep learning appear to be affected by parameter symmetries -- transformations of neural network parameters that do not change the underlying neural network function. These include linear mode connectivity, model merging, Bayesian neural network inference, metanetworks, and several other characteristics of optimization or loss-landscapes. However, theoretical analysis of the relationship between parameter space symmetries and these phenomena is difficult. In this work, we empirically investigate the impact of neural parameter symmetries by introducing new neural network architectures that have reduced parameter space symmetries. We develop two methods, with some provable guarantees, of modifying standard neural networks to reduce parameter space symmetries. With these new methods, we conduct a comprehensive experimental study consisting of multiple tasks aimed at assessing the effect of removing parameter symmetries. Our experiments reveal several interesting observations on the empirical impact of parameter symmetries; for instance, we observe linear mode connectivity between our networks without alignment of weight spaces, and we find that our networks allow for faster and more effective Bayesian neural network training.
- [1248] arXiv:2405.20947 (replaced) [pdf, html, other]
-
Title: OR-Bench: An Over-Refusal Benchmark for Large Language ModelsComments: version 2, 10 pages main, 22 pages totalSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation, the enhanced safety often come with the side effect of over-refusal, where LLMs may reject innocuous prompts and become less helpful. Although the issue of over-refusal has been empirically observed, a systematic measurement is challenging due to the difficulty of crafting prompts that appear harmful but are benign. This study proposes a novel method for automatically generating large-scale sets of "seemingly toxic prompts" (benign prompts likely rejected by LLMs). Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 seemingly toxic prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 25 popular LLMs across 8 model families. Our datasets are available at this https URL and the demo can be found at this https URL. We hope this benchmark can help the community develop better safety aligned models.
- [1249] arXiv:2405.21027 (replaced) [pdf, html, other]
-
Title: Fusion-PSRO: Nash Policy Fusion for Policy Space Response OraclesComments: 20 pages, 5 figuresSubjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
A popular approach for solving zero-sum games is to maintain populations of policies to approximate the Nash Equilibrium (NE). Previous studies have shown that Policy Space Response Oracle (PSRO) algorithm is an effective multi-agent reinforcement learning framework for solving such games. However, repeatedly training new policies from scratch to approximate Best Response (BR) to opponents' mixed policies at each iteration is both inefficient and costly. While some PSRO variants initialize a new policy by inheriting from past BR policies, this approach limits the exploration of new policies, especially against challenging opponents. To address this issue, we propose Fusion-PSRO, which employs policy fusion to initialize policies for better approximation to BR. By selecting high-quality base policies from meta-NE, policy fusion fuses the base policies into a new policy through model averaging. This approach allows the initialized policies to incorporate multiple expert policies, making it easier to handle difficult opponents compared to inheriting from past BR policies or initializing from scratch. Moreover, our method only modifies the policy initialization phase, allowing its application to nearly all PSRO variants without additional training overhead. Our experiments on non-transitive matrix games, Leduc Poker, and the more complex Liars Dice demonstrate that Fusion-PSRO enhances the performance of nearly all PSRO variants, achieving lower exploitability.
- [1250] arXiv:2406.00014 (replaced) [pdf, html, other]
-
Title: KU-DMIS at EHRSQL 2024:Generating SQL query via question templatization in EHRHajung Kim, Chanhwi Kim, Hoonick Lee, Kyochul Jang, Jiwoo Lee, Kyungjae Lee, Gangwoo Kim, Jaewoo KangComments: Published at ClinicalNLP workshop @ NAACL 2024Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Transforming natural language questions into SQL queries is crucial for precise data retrieval from electronic health record (EHR) databases. A significant challenge in this process is detecting and rejecting unanswerable questions that request information beyond the database's scope or exceed the system's capabilities. In this paper, we introduce a novel text-to-SQL framework that robustly handles out-of-domain questions and verifies the generated queries with query execution.Our framework begins by standardizing the structure of questions into a templated format. We use a powerful large language model (LLM), fine-tuned GPT-3.5 with detailed prompts involving the table schemas of the EHR database system. Our experimental results demonstrate the effectiveness of our framework on the EHRSQL-2024 benchmark benchmark, a shared task in the ClinicalNLP workshop. Although a straightforward fine-tuning of GPT shows promising results on the development set, it struggled with the out-of-domain questions in the test set. With our framework, we improve our system's adaptability and achieve competitive performances in the official leaderboard of the EHRSQL-2024 challenge.
- [1251] arXiv:2406.00621 (replaced) [pdf, html, other]
-
Title: Log-Scale Quantization in Distributed First-Order Methods: Gradient-based Learning from Distributed DataMohammadreza Doostmohammadian, Muhammad I. Qureshi, Mohammad Hossein Khalesi, Hamid R. Rabiee, Usman A. KhanSubjects: Systems and Control (eess.SY); Signal Processing (eess.SP); Optimization and Control (math.OC)
Decentralized strategies are of interest for learning from large-scale data over networks. This paper studies learning over a network of geographically distributed nodes/agents subject to quantization. Each node possesses a private local cost function, collectively contributing to a global cost function, which the proposed methodology aims to minimize. In contrast to many existing literature, the information exchange among nodes is quantized. We adopt a first-order computationally-efficient distributed optimization algorithm (with no extra inner consensus loop) that leverages node-level gradient correction based on local data and network-level gradient aggregation only over nearby nodes. This method only requires balanced networks with no need for stochastic weight design. It can handle log-scale quantized data exchange over possibly time-varying and switching network setups. We analyze convergence over both structured networks (for example, training over data-centers) and ad-hoc multi-agent networks (for example, training over dynamic robotic networks). Through analysis and experimental validation, we show that (i) structured networks generally result in a smaller optimality gap, and (ii) logarithmic quantization leads to smaller optimality gap compared to uniform quantization.
- [1252] arXiv:2406.00799 (replaced) [pdf, html, other]
-
Title: Are you still on track!? Catching LLM Task Drift with ActivationsSubjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computers and Society (cs.CY)
Large Language Models (LLMs) are routinely used in retrieval-augmented applications to orchestrate tasks and process inputs from users and other sources. These inputs, even in a single LLM interaction, can come from a variety of sources, of varying trustworthiness and provenance. This opens the door to prompt injection attacks, where the LLM receives and acts upon instructions from supposedly data-only sources, thus deviating from the user's original instructions. We define this as task drift, and we propose to catch it by scanning and analyzing the LLM's activations. We compare the LLM's activations before and after processing the external input in order to detect whether this input caused instruction drift. We develop two probing methods and find that simply using a linear classifier can detect drift with near perfect ROC AUC on an out-of-distribution test set. We show that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions, without being trained on any of these attacks. Our setup does not require any modification of the LLM (e.g., fine-tuning) or any text generation, thus maximizing deployability and cost efficiency and avoiding reliance on unreliable model output. To foster future research on activation-based task inspection, decoding, and interpretability, we will release our large-scale TaskTracker toolkit, comprising a dataset of over 500K instances, representations from 4 SoTA language models, and inspection tools.
- [1253] arXiv:2406.01013 (replaced) [pdf, html, other]
-
Title: Scalable Ensembling For Mitigating Reward OveroptimisationSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Reinforcement Learning from Human Feedback (RLHF) has enabled significant advancements within language modeling for powerful, instruction-following models. However, the alignment of these models remains a pressing challenge as the policy tends to overfit the learned ``proxy" reward model past an inflection point of utility as measured by a ``gold" reward model that is more performant -- a phenomenon known as overoptimisation. Prior work has mitigated this issue by computing a pessimistic statistic over an ensemble of reward models, which is common in Offline Reinforcement Learning but incredibly costly for language models with high memory requirements, making such approaches infeasible for sufficiently large models. To this end, we propose using a shared encoder but separate linear heads. We find this leads to similar performance as the full ensemble while allowing tremendous savings in memory and time required for training for models of similar size.
- [1254] arXiv:2406.01438 (replaced) [pdf, html, other]
-
Title: Asynchronous Byzantine Federated LearningSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Federated learning (FL) enables a set of geographically distributed clients to collectively train a model through a server. Classically, the training process is synchronous, but can be made asynchronous to maintain its speed in presence of slow clients and in heterogeneous networks. The vast majority of Byzantine fault-tolerant FL systems however rely on a synchronous training process. Our solution is one of the first Byzantine-resilient and asynchronous FL algorithms that does not require an auxiliary server dataset and is not delayed by stragglers, which are shortcomings of previous works. Intuitively, the server in our solution waits to receive a minimum number of updates from clients on its latest model to safely update it, and is later able to safely leverage the updates that late clients might send. We compare the performance of our solution with state-of-the-art algorithms on both image and text datasets under gradient inversion, perturbation, and backdoor attacks. Our results indicate that our solution trains a model faster than previous synchronous FL solution, and maintains a higher accuracy, up to 1.54x and up to 1.75x for perturbation and gradient inversion attacks respectively, in the presence of Byzantine clients than previous asynchronous FL solutions.
- [1255] arXiv:2406.01439 (replaced) [pdf, html, other]
-
Title: Asynchronous Multi-Server Federated Learning for Geo-Distributed ClientsSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Federated learning (FL) systems enable multiple clients to train a machine learning model iteratively through synchronously exchanging the intermediate model weights with a single server. The scalability of such FL systems can be limited by two factors: server idle time due to synchronous communication and the risk of a single server becoming the bottleneck. In this paper, we propose a new FL architecture, to our knowledge, the first multi-server FL system that is entirely asynchronous, and therefore addresses these two limitations simultaneously. Our solution keeps both servers and clients continuously active. As in previous multi-server methods, clients interact solely with their nearest server, ensuring efficient update integration into the model. Differently, however, servers also periodically update each other asynchronously, and never postpone interactions with clients. We compare our solution to three representative baselines - FedAvg, FedAsync and HierFAVG - on the MNIST and CIFAR-10 image classification datasets and on the WikiText-2 language modeling dataset. Our solution converges to similar or higher accuracy levels than previous baselines and requires 61% less time to do so in geo-distributed settings.
- [1256] arXiv:2406.01584 (replaced) [pdf, html, other]
-
Title: SpatialRGPT: Grounded Spatial Reasoning in Vision Language ModelComments: Project Page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV)
Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision and language tasks. However, their ability to reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities. SpatialRGPT advances VLMs' spatial understanding through two key innovations: (1) a data curation pipeline that enables effective learning of regional representation from 3D scene graphs, and (2) a flexible plugin module for integrating depth information into the visual encoder of existing VLMs. During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances. Additionally, we propose SpatialRGBT-Bench, a benchmark with ground-truth 3D annotations encompassing indoor, outdoor, and simulated environments, for evaluating 3D spatial cognition in VLMs. Our results demonstrate that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and without local region prompts. The model also exhibits strong generalization capabilities, effectively reasoning about complex spatial relations and functioning as a region-aware dense reward annotator for robotic tasks. Code, dataset, and benchmark will be released at this https URL
- [1257] arXiv:2406.01607 (replaced) [pdf, html, other]
-
Title: Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB BenchmarkComments: 21 pagesSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Text embedding methods have become increasingly popular in both industrial and academic fields due to their critical role in a variety of natural language processing tasks. The significance of universal text embeddings has been further highlighted with the rise of Large Language Models (LLMs) applications such as Retrieval-Augmented Systems (RAGs). While previous models have attempted to be general-purpose, they often struggle to generalize across tasks and domains. However, recent advancements in training data quantity, quality and diversity; synthetic data generation from LLMs as well as using LLMs as backbones encourage great improvements in pursuing universal text embeddings. In this paper, we provide an overview of the recent advances in universal text embedding models with a focus on the top performing text embeddings on Massive Text Embedding Benchmark (MTEB). Through detailed comparison and analysis, we highlight the key contributions and limitations in this area, and propose potentially inspiring future research directions.
- [1258] arXiv:2406.02548 (replaced) [pdf, html, other]
-
Title: Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance SegmentationMohamed El Amine Boudjoghra, Angela Dai, Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Shahbaz KhanSubjects: Computer Vision and Pattern Recognition (cs.CV)
Recent works on open-vocabulary 3D instance segmentation show strong promise, but at the cost of slow inference speed and high computation requirements. This high computation cost is typically due to their heavy reliance on 3D clip features, which require computationally expensive 2D foundation models like Segment Anything (SAM) and CLIP for multi-view aggregation into 3D. As a consequence, this hampers their applicability in many real-world applications that require both fast and accurate predictions. To this end, we propose a fast yet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO 3D, that effectively leverages only 2D object detection from multi-view RGB images for open-vocabulary 3D instance segmentation. We address this task by generating class-agnostic 3D masks for objects in the scene and associating them with text prompts. We observe that the projection of class-agnostic 3D point cloud instances already holds instance information; thus, using SAM might only result in redundancy that unnecessarily increases the inference time. We empirically find that a better performance of matching text prompts to 3D masks can be achieved in a faster fashion with a 2D object detector. We validate our Open-YOLO 3D on two benchmarks, ScanNet200 and Replica, under two scenarios: (i) with ground truth masks, where labels are required for given object proposals, and (ii) with class-agnostic 3D proposals generated from a 3D proposal network. Our Open-YOLO 3D achieves state-of-the-art performance on both datasets while obtaining up to $\sim$16$\times$ speedup compared to the best existing method in literature. On ScanNet200 val. set, our Open-YOLO 3D achieves mean average precision (mAP) of 24.7\% while operating at 22 seconds per scene. Code and model are available at this http URL.
- [1259] arXiv:2406.02836 (replaced) [pdf, html, other]
-
Title: DREW : Towards Robust Data Provenance by Leveraging Error-Controlled WatermarkingSubjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
Identifying the origin of data is crucial for data provenance, with applications including data ownership protection, media forensics, and detecting AI-generated content. A standard approach involves embedding-based retrieval techniques that match query data with entries in a reference dataset. However, this method is not robust against benign and malicious edits. To address this, we propose Data Retrieval with Error-corrected codes and Watermarking (DREW). DREW randomly clusters the reference dataset, injects unique error-controlled watermark keys into each cluster, and uses these keys at query time to identify the appropriate cluster for a given sample. After locating the relevant cluster, embedding vector similarity retrieval is performed within the cluster to find the most accurate matches. The integration of error control codes (ECC) ensures reliable cluster assignments, enabling the method to perform retrieval on the entire dataset in case the ECC algorithm cannot detect the correct cluster with high confidence. This makes DREW maintain baseline performance, while also providing opportunities for performance improvements due to the increased likelihood of correctly matching queries to their origin when performing retrieval on a smaller subset of the dataset. Depending on the watermark technique used, DREW can provide substantial improvements in retrieval accuracy (up to 40\% for some datasets and modification types) across multiple datasets and state-of-the-art embedding models (e.g., DinoV2, CLIP), making our method a promising solution for secure and reliable source identification. The code is available at this https URL
- [1260] arXiv:2406.04027 (replaced) [pdf, html, other]
-
Title: PowerPeeler: A Precise and General Dynamic Deobfuscation Method for PowerShell ScriptsComments: To appear in the ACM CCS 2024Subjects: Cryptography and Security (cs.CR); Software Engineering (cs.SE)
PowerShell is a powerful and versatile task automation tool. Unfortunately, it is also widely abused by cyber attackers. To bypass malware detection and hinder threat analysis, attackers often employ diverse techniques to obfuscate malicious PowerShell scripts. Existing deobfuscation tools suffer from the limitation of static analysis, which fails to simulate the real deobfuscation process accurately.
In this paper, we propose PowerPeeler. To the best of our knowledge, it is the first dynamic PowerShell script deobfuscation approach at the instruction level. It utilizes expression-related Abstract Syntax Tree (AST) nodes to identify potential obfuscated script pieces. Then, PowerPeeler correlates the AST nodes with their corresponding instructions and monitors the script's entire execution process. Subsequently, PowerPeeler dynamically tracks the execution of these instructions and records their execution results. Finally, PowerPeeler stringifies these results to replace the corresponding obfuscated script pieces and reconstruct the deobfuscated script.
To evaluate the effectiveness of PowerPeeler, we collect 1,736,669 real-world malicious PowerShell samples with diversity obfuscation methods. We compare PowerPeeler with five state-of-the-art deobfuscation tools and GPT-4. The evaluation results demonstrate that PowerPeeler can effectively handle all well-known obfuscation methods. Additionally, the deobfuscation correctness rate of PowerPeeler reaches 95%, significantly surpassing that of other tools. PowerPeeler not only recovers the highest amount of sensitive data but also maintains a semantic consistency over 97%, which is also the best. Moreover, PowerPeeler effectively obtains the largest quantity of valid deobfuscated results within a limited time frame. Furthermore, PowerPeeler is extendable and can be used as a helpful tool for other cyber security solutions. - [1261] arXiv:2406.04264 (replaced) [pdf, html, other]
-
Title: MLVU: A Comprehensive Benchmark for Multi-Task Long Video UnderstandingJunjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, Zheng LiuSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
The evaluation of Long Video Understanding (LVU) performance poses an important but challenging research problem. Despite previous efforts, the existing video understanding benchmarks are severely constrained by several issues, especially the insufficient lengths of videos, a lack of diversity in video types and evaluation tasks, and the inappropriateness for evaluating LVU performances. To address the above problems, we propose a new benchmark, called MLVU (Multi-task Long Video Understanding Benchmark), for the comprehensive and in-depth evaluation of LVU. MLVU presents the following critical values: 1) The substantial and flexible extension of video lengths, which enables the benchmark to evaluate LVU performance across a wide range of durations. 2) The inclusion of various video genres, e.g., movies, surveillance footage, egocentric videos, cartoons, game videos, etc., which reflects the models' LVU performances in different scenarios. 3) The development of diversified evaluation tasks, which enables a comprehensive examination of MLLMs' key abilities in long-video understanding. The empirical study with 20 latest MLLMs reveals significant room for improvement in today's technique, as all existing methods struggle with most of the evaluation tasks and exhibit severe performance degradation when handling longer videos. Additionally, it suggests that factors such as context length, image-understanding quality, and the choice of LLM backbone can play critical roles in future advancements. We anticipate that MLVU will advance the research of long video understanding by providing a comprehensive and in-depth analysis of MLLMs.
- [1262] arXiv:2406.04589 (replaced) [pdf, html, other]
-
Title: MUSE: Flexible Voiceprint Receptive Fields and Multi-Path Fusion Enhanced Taylor Transformer for U-Net-based Speech EnhancementComments: This paper was accepted by Interspeech 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Achieving a balance between lightweight design and high performance remains a challenging task for speech enhancement. In this paper, we introduce Multi-path Enhanced Taylor (MET) Transformer based U-net for Speech Enhancement (MUSE), a lightweight speech enhancement network built upon the Unet architecture. Our approach incorporates a novel Multi-path Enhanced Taylor (MET) Transformer block, which integrates Deformable Embedding (DE) to enable flexible receptive fields for voiceprints. The MET Transformer is uniquely designed to fuse Channel and Spatial Attention (CSA) branches, facilitating channel information exchange and addressing spatial attention deficits within the Taylor-Transformer framework. Through extensive experiments conducted on the VoiceBank+DEMAND dataset, we demonstrate that MUSE achieves competitive performance while significantly reducing both training and deployment costs, boasting a mere 0.51M parameters.
- [1263] arXiv:2406.04809 (replaced) [pdf, html, other]
-
Title: A Survey of Fragile Model WatermarkingComments: Submitted Expert Systems with ApplicationsSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
Model fragile watermarking, inspired by both the field of adversarial attacks on neural networks and traditional multimedia fragile watermarking, has gradually emerged as a potent tool for detecting tampering, and has witnessed rapid development in recent years. Unlike robust watermarks, which are widely used for identifying model copyrights, fragile watermarks for models are designed to identify whether models have been subjected to unexpected alterations such as backdoors, poisoning, compression, among others. These alterations can pose unknown risks to model users, such as misidentifying stop signs as speed limit signs in classic autonomous driving scenarios. This paper provides an overview of the relevant work in the field of model fragile watermarking since its inception, categorizing them and revealing the developmental trajectory of the field, thus offering a comprehensive survey for future endeavors in model fragile watermarking.
- [1264] arXiv:2406.05250 (replaced) [pdf, html, other]
-
Title: LLM-Enhanced Bayesian Optimization for Efficient Analog Layout Constraint GenerationSubjects: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
Analog layout synthesis faces significant challenges due to its dependence on manual processes, considerable time requirements, and performance instability. Current Bayesian Optimization (BO)-based techniques for analog layout synthesis, despite their potential for automation, suffer from slow convergence and extensive data needs, limiting their practical application. This paper presents the \texttt{LLANA} framework, a novel approach that leverages Large Language Models (LLMs) to enhance BO by exploiting the few-shot learning abilities of LLMs for more efficient generation of analog design-dependent parameter constraints. Experimental results demonstrate that \texttt{LLANA} not only achieves performance comparable to state-of-the-art (SOTA) BO methods but also enables a more effective exploration of the analog circuit design space, thanks to LLM's superior contextual understanding and learning efficiency. The code is available at this https URL.
- [1265] arXiv:2406.05346 (replaced) [pdf, html, other]
-
Title: ProG: A Graph Prompt Learning BenchmarkSubjects: Machine Learning (cs.LG)
Artificial general intelligence on graphs has shown significant advancements across various applications, yet the traditional 'Pre-train & Fine-tune' paradigm faces inefficiencies and negative transfer issues, particularly in complex and few-shot settings. Graph prompt learning emerges as a promising alternative, leveraging lightweight prompts to manipulate data and fill the task gap by reformulating downstream tasks to the pretext. However, several critical challenges still remain: how to unify diverse graph prompt models, how to evaluate the quality of graph prompts, and to improve their usability for practical comparisons and selection. In response to these challenges, we introduce the first comprehensive benchmark for graph prompt learning. Our benchmark integrates SIX pre-training methods and FIVE state-of-the-art graph prompt techniques, evaluated across FIFTEEN diverse datasets to assess performance, flexibility, and efficiency. We also present 'ProG', an easy-to-use open-source library that streamlines the execution of various graph prompt models, facilitating objective evaluations. Additionally, we propose a unified framework that categorizes existing graph prompt methods into two main approaches: prompts as graphs and prompts as tokens. This framework enhances the applicability and comparison of graph prompt techniques. The code is available at: this https URL.
- [1266] arXiv:2406.05460 (replaced) [pdf, html, other]
-
Title: Fighting Against the Repetitive Training and Sample Dependency Problem in Few-shot Named Entity RecognitionComments: ieee access: this https URLSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Few-shot named entity recognition (NER) systems recognize entities using a few labeled training examples. The general pipeline consists of a span detector to identify entity spans in text and an entity-type classifier to assign types to entities. Current span detectors rely on extensive manual labeling to guide training. Almost every span detector requires initial training on basic span features followed by adaptation to task-specific features. This process leads to repetitive training of the basic span features among span detectors. Additionally, metric-based entity-type classifiers, such as prototypical networks, typically employ a specific metric that gauges the distance between the query sample and entity-type referents, ultimately assigning the most probable entity type to the query sample. However, these classifiers encounter the sample dependency problem, primarily stemming from the limited samples available for each entity-type referent. To address these challenges, we proposed an improved few-shot NER pipeline. First, we introduce a steppingstone span detector that is pre-trained on open-domain Wikipedia data. It can be used to initialize the pipeline span detector to reduce the repetitive training of basic features. Second, we leverage a large language model (LLM) to set reliable entity-type referents, eliminating reliance on few-shot samples of each type. Our model exhibits superior performance with fewer training steps and human-labeled data compared with baselines, as demonstrated through extensive experiments on various datasets. Particularly in fine-grained few-shot NER settings, our model outperforms strong baselines, including ChatGPT. We will publicly release the code, datasets, LLM outputs, and model checkpoints.
- [1267] arXiv:2406.05928 (replaced) [pdf, html, other]
-
Title: Stabler Neo-Hookean Simulation: Absolute Eigenvalue Filtering for Projected NewtonComments: SIGGRAPH 2024 (Conference track). Project page: this https URLSubjects: Graphics (cs.GR); Numerical Analysis (math.NA)
Volume-preserving hyperelastic materials are widely used to model near-incompressible materials such as rubber and soft tissues. However, the numerical simulation of volume-preserving hyperelastic materials is notoriously challenging within this regime due to the non-convexity of the energy function. In this work, we identify the pitfalls of the popular eigenvalue clamping strategy for projecting Hessian matrices to positive semi-definiteness during Newton's method. We introduce a novel eigenvalue filtering strategy for projected Newton's method to stabilize the optimization of Neo-Hookean energy and other volume-preserving variants under high Poisson's ratio (near 0.5) and large initial volume change. Our method only requires a single line of code change in the existing projected Newton framework, while achieving significant improvement in both stability and convergence speed. We demonstrate the effectiveness and efficiency of our eigenvalue projection scheme on a variety of challenging examples and over different deformations on a large dataset.
- [1268] arXiv:2406.05959 (replaced) [pdf, html, other]
-
Title: MAGNOLIA: Matching Algorithms via GNNs for Online Value-to-go ApproximationComments: Accepted to ICML 2024; Hardness result from introduction made preciseSubjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
Online Bayesian bipartite matching is a central problem in digital marketplaces and exchanges, including advertising, crowdsourcing, ridesharing, and kidney exchange. We introduce a graph neural network (GNN) approach that emulates the problem's combinatorially-complex optimal online algorithm, which selects actions (e.g., which nodes to match) by computing each action's value-to-go (VTG) -- the expected weight of the final matching if the algorithm takes that action, then acts optimally in the future. We train a GNN to estimate VTG and show empirically that this GNN returns high-weight matchings across a variety of tasks. Moreover, we identify a common family of graph distributions in spatial crowdsourcing applications, such as rideshare, under which VTG can be efficiently approximated by aggregating information within local neighborhoods in the graphs. This structure matches the local behavior of GNNs, providing theoretical justification for our approach.
- [1269] arXiv:2406.06366 (replaced) [pdf, html, other]
-
Title: Symmetric Dot-Product Attention for Efficient Training of BERT Language ModelsComments: to be published in Findings of the Association for Computational Linguistics: ACL 2024Subjects: Computation and Language (cs.CL)
Initially introduced as a machine translation model, the Transformer architecture has now become the foundation for modern deep learning architecture, with applications in a wide range of fields, from computer vision to natural language processing. Nowadays, to tackle increasingly more complex tasks, Transformer-based models are stretched to enormous sizes, requiring increasingly larger training datasets, and unsustainable amount of compute resources. The ubiquitous nature of the Transformer and its core component, the attention mechanism, are thus prime targets for efficiency research. In this work, we propose an alternative compatibility function for the self-attention mechanism introduced by the Transformer architecture. This compatibility function exploits an overlap in the learned representation of the traditional scaled dot-product attention, leading to a symmetric with pairwise coefficient dot-product attention. When applied to the pre-training of BERT-like models, this new symmetric attention mechanism reaches a score of 79.36 on the GLUE benchmark against 78.74 for the traditional implementation, leads to a reduction of 6% in the number of trainable parameters, and reduces the number of training steps required before convergence by half.
- [1270] arXiv:2406.06367 (replaced) [pdf, html, other]
-
Title: MVGamba: Unify 3D Content Generation as State Space Sequence ModelingXuanyu Yi, Zike Wu, Qiuhong Shen, Qingshan Xu, Pan Zhou, Joo-Hwee Lim, Shuicheng Yan, Xinchao Wang, Hanwang ZhangSubjects: Computer Vision and Pattern Recognition (cs.CV)
Recent 3D large reconstruction models (LRMs) can generate high-quality 3D content in sub-seconds by integrating multi-view diffusion models with scalable multi-view reconstructors. Current works further leverage 3D Gaussian Splatting as 3D representation for improved visual quality and rendering efficiency. However, we observe that existing Gaussian reconstruction models often suffer from multi-view inconsistency and blurred textures. We attribute this to the compromise of multi-view information propagation in favor of adopting powerful yet computationally intensive architectures (e.g., Transformers). To address this issue, we introduce MVGamba, a general and lightweight Gaussian reconstruction model featuring a multi-view Gaussian reconstructor based on the RNN-like State Space Model (SSM). Our Gaussian reconstructor propagates causal context containing multi-view information for cross-view self-refinement while generating a long sequence of Gaussians for fine-detail modeling with linear complexity. With off-the-shelf multi-view diffusion models integrated, MVGamba unifies 3D generation tasks from a single image, sparse images, or text prompts. Extensive experiments demonstrate that MVGamba outperforms state-of-the-art baselines in all 3D content generation scenarios with approximately only $0.1\times$ of the model size.
- [1271] arXiv:2406.06385 (replaced) [pdf, html, other]
-
Title: Low-Rank Quantization-Aware Training for LLMsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and memory efficient. Quantization-aware training (QAT) methods, generally produce the best quantized performance, however it comes at the cost of potentially long training time and excessive memory usage, making it impractical when applying for LLMs. Inspired by parameter-efficient fine-tuning (PEFT) and low-rank adaptation (LoRA) literature, we propose LR-QAT -- a lightweight and memory-efficient QAT algorithm for LLMs. LR-QAT employs several components to save memory without sacrificing predictive performance: (a) low-rank auxiliary weights that are aware of the quantization grid; (b) a downcasting operator using fixed-point or double-packed integers and (c) checkpointing. Unlike most related work, our method (i) is inference-efficient, leading to no additional overhead compared to traditional PTQ; (ii) can be seen as a general extended pretraining framework, meaning that the resulting model can still be utilized for any downstream task afterwards; (iii) can be applied across a wide range of quantization settings, such as different choices quantization granularity, activation quantization, and seamlessly combined with many PTQ techniques. We apply LR-QAT to LLaMA-2/3 and Mistral model families and validate its effectiveness on several downstream tasks. Our method outperforms common post-training quantization (PTQ) approaches and reaches the same model performance as full-model QAT at the fraction of its memory usage. Specifically, we can train a 7B LLM on a single consumer grade GPU with 24GB of memory.
- [1272] arXiv:2406.06393 (replaced) [pdf, html, other]
-
Title: STimage-1K4M: A histopathology image-gene expression dataset for spatial transcriptomicsSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Genomics (q-bio.GN)
Recent advances in multi-modal algorithms have driven and been driven by the increasing availability of large image-text datasets, leading to significant strides in various fields, including computational pathology. However, in most existing medical image-text datasets, the text typically provides high-level summaries that may not sufficiently describe sub-tile regions within a large pathology image. For example, an image might cover an extensive tissue area containing cancerous and healthy regions, but the accompanying text might only specify that this image is a cancer slide, lacking the nuanced details needed for in-depth analysis. In this study, we introduce STimage-1K4M, a novel dataset designed to bridge this gap by providing genomic features for sub-tile images. STimage-1K4M contains 1,149 images derived from spatial transcriptomics data, which captures gene expression information at the level of individual spatial spots within a pathology image. Specifically, each image in the dataset is broken down into smaller sub-image tiles, with each tile paired with 15,000-30,000 dimensional gene expressions. With 4,293,195 pairs of sub-tile images and gene expressions, STimage-1K4M offers unprecedented granularity, paving the way for a wide range of advanced research in multi-modal data analysis an innovative applications in computational pathology, and beyond.
- [1273] arXiv:2406.06479 (replaced) [pdf, html, other]
-
Title: Graph-Based Bidirectional Transformer Decision Threshold Adjustment Algorithm for Class-Imbalanced Molecular DataSubjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Data sets with imbalanced class sizes, often where one class size is much smaller than that of others, occur extremely often in various applications, including those with biological foundations, such as drug discovery and disease diagnosis. Thus, it is extremely important to be able to identify data elements of classes of various sizes, as a failure to detect can result in heavy costs. However, many data classification algorithms do not perform well on imbalanced data sets as they often fail to detect elements belonging to underrepresented classes. In this paper, we propose the BTDT-MBO algorithm, incorporating Merriman-Bence-Osher (MBO) techniques and a bidirectional transformer, as well as distance correlation and decision threshold adjustments, for data classification problems on highly imbalanced molecular data sets, where the sizes of the classes vary greatly. The proposed method not only integrates adjustments in the classification threshold for the MBO algorithm in order to help deal with the class imbalance, but also uses a bidirectional transformer model based on an attention mechanism for self-supervised learning. Additionally, the method implements distance correlation as a weight function for the similarity graph-based framework on which the adjusted MBO algorithm operates. The proposed model is validated using six molecular data sets, and we also provide a thorough comparison to other competing algorithms. The computational experiments show that the proposed method performs better than competing techniques even when the class imbalance ratio is very high.
- [1274] arXiv:2406.06840 (replaced) [pdf, html, other]
-
Title: Silent Signals, Loud Impact: LLMs for Word-Sense Disambiguation of Coded Dog WhistlesComments: ACL 2024Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
A dog whistle is a form of coded communication that carries a secondary meaning to specific audiences and is often weaponized for racial and socioeconomic discrimination. Dog whistling historically originated from United States politics, but in recent years has taken root in social media as a means of evading hate speech detection systems and maintaining plausible deniability. In this paper, we present an approach for word-sense disambiguation of dog whistles from standard speech using Large Language Models (LLMs), and leverage this technique to create a dataset of 16,550 high-confidence coded examples of dog whistles used in formal and informal communication. Silent Signals is the largest dataset of disambiguated dog whistle usage, created for applications in hate speech detection, neology, and political science.
The dataset can be found at this https URL. - [1275] arXiv:2406.06858 (replaced) [pdf, html, other]
-
Title: FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel FusionLi-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, Xin LiuSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation to meet a certain latency requirement. However, this kind of parallelism introduces additional communication that might contribute a significant portion of overall runtime. Thus limits scalability of this technique within a group of devices with high speed interconnects, such as GPUs with NVLinks in a node. This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs. Flux over-decomposes communication and computation operations into much finer-grained operations and further fuses them into a larger kernel to effectively hide communication without compromising kernel efficiency. Flux can potentially overlap up to 96% of communication given a fused kernel. Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPUs with various GPU generations and interconnects, and up to 1.66x and 1.30x speedups for prefill and decoding inference over vLLM on a cluster with 8 GPUs with various GPU generations and interconnects.
- [1276] arXiv:2406.06874 (replaced) [pdf, html, other]
-
Title: Joint Demonstration and Preference Learning Improves Policy Alignment with Human FeedbackSubjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
Aligning human preference and value is an important requirement for building contemporary foundation models and embodied AI. However, popular approaches such as reinforcement learning with human feedback (RLHF) break down the task into successive stages, such as supervised fine-tuning (SFT), reward modeling (RM), and reinforcement learning (RL), each performing one specific learning task. Such a sequential approach results in serious issues such as significant under-utilization of data and distribution mismatch between the learned reward model and generated policy, which eventually lead to poor alignment performance. We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF), capable of integrating both human preference and demonstration to train reward models and the policy. The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms such as RLHF and Directly Policy Optimization (DPO), and only requires minor changes to the existing alignment pipelines. We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo. We observe that the proposed solutions outperform the existing alignment algorithms such as RLHF and DPO by large margins, especially when the amount of high-quality preference data is relatively limited.
- [1277] arXiv:2406.06978 (replaced) [pdf, html, other]
-
Title: Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-DistillationZhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, Yu-Gang Jiang, Jose M. AlvarezComments: The 1st place solution of End-to-end Driving at Scale at the CVPR 2024 Autonomous Grand ChallengeSubjects: Computer Vision and Pattern Recognition (cs.CV)
We propose Hydra-MDP, a novel paradigm employing multiple teachers in a teacher-student model. This approach uses knowledge distillation from both human and rule-based teachers to train the student model, which features a multi-head decoder to learn diverse trajectory candidates tailored to various evaluation metrics. With the knowledge of rule-based teachers, Hydra-MDP learns how the environment influences the planning in an end-to-end manner instead of resorting to non-differentiable post-processing. This method achieves the $1^{st}$ place in the Navsim challenge, demonstrating significant improvements in generalization across diverse driving environments and conditions. Code will be available at \url{this https URL}.
- [1278] arXiv:2406.07424 (replaced) [pdf, html, other]
-
Title: MINERS: Multilingual Language Models as Semantic RetrieversComments: PreprintSubjects: Computation and Language (cs.CL)
Words have been represented in a high-dimensional vector space that encodes their semantic similarities, enabling downstream applications such as retrieving synonyms, antonyms, and relevant contexts. However, despite recent advances in multilingual language models (LMs), the effectiveness of these models' representations in semantic retrieval contexts has not been comprehensively explored. To fill this gap, this paper introduces the MINERS, a benchmark designed to evaluate the ability of multilingual LMs in semantic retrieval tasks, including bitext mining and classification via retrieval-augmented contexts. We create a comprehensive framework to assess the robustness of LMs in retrieving samples across over 200 diverse languages, including extremely low-resource languages in challenging cross-lingual and code-switching settings. Our results demonstrate that by solely retrieving semantically similar embeddings yields performance competitive with state-of-the-art approaches, without requiring any fine-tuning.
- [1279] arXiv:2406.07598 (replaced) [pdf, html, other]
-
Title: Equivariance via Minimal Frame Averaging for More Symmetries and EfficiencySubjects: Machine Learning (cs.LG)
We consider achieving equivariance in machine learning systems via frame averaging. Current frame averaging methods involve a costly sum over large frames or rely on sampling-based approaches that only yield approximate equivariance. Here, we propose Minimal Frame Averaging (MFA), a mathematical framework for constructing provably minimal frames that are exactly equivariant. The general foundations of MFA also allow us to extend frame averaging to more groups than previously considered, including the Lorentz group for describing symmetries in space-time, and the unitary group for complex-valued domains. Results demonstrate the efficiency and effectiveness of encoding symmetries via MFA across a diverse range of tasks, including $n$-body simulation, top tagging in collider physics, and relaxed energy prediction. Our code is available at this https URL.
- [1280] arXiv:2406.07730 (replaced) [pdf, other]
-
Title: "It answers questions that I didn't know I had": Ph.D. Students' Evaluation of an Information Sharing Knowledge GraphSubjects: Human-Computer Interaction (cs.HC); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
Interdisciplinary PhD programs can be challenging as the vital information needed by students may not be readily available, it is scattered across university's websites, while tacit knowledge can be obtained only by interacting with people. Hence, there is a need to develop a knowledge management model to create, query, and maintain a knowledge repository for interdisciplinary students. We propose a knowledge graph containing information on critical categories and their relationships, extracted from multiple sources, essential for interdisciplinary PhD students. This study evaluates the usability of a participatory designed knowledge graph intended to facilitate information exchange and decision-making. The usability findings demonstrate that interaction with this knowledge graph benefits PhD students by notably reducing uncertainty and academic stress, particularly among newcomers. Knowledge graph supported them in decision making, especially when choosing collaborators in an interdisciplinary setting. Key helpful features are related to exploring student faculty networks, milestones tracking, rapid access to aggregated data, and insights into crowdsourced fellow students' activities. The knowledge graph provides a solution to meet the personalized needs of doctoral researchers and has the potential to improve the information discovery and decision-making process substantially. It also includes the tacit knowledge exchange support missing from most current approaches, which is critical for this population and establishing interdisciplinary collaborations. This approach can be applied to other interdisciplinary programs and domains globally.
- [1281] arXiv:2406.08416 (replaced) [pdf, html, other]
-
Title: TokSing: Singing Voice Synthesis based on Discrete TokensComments: Accepted by Interspeech 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Recent advancements in speech synthesis witness significant benefits by leveraging discrete tokens extracted from self-supervised learning (SSL) models. Discrete tokens offer higher storage efficiency and greater operability in intermediate representations compared to traditional continuous Mel spectrograms. However, when it comes to singing voice synthesis(SVS), achieving higher levels of melody expression poses a great challenge for utilizing discrete tokens. In this paper, we introduce TokSing, a discrete-based SVS system equipped with a token formulator that offers flexible token blendings. We observe a melody degradation during discretization, prompting us to integrate a melody signal with the discrete token and incorporate a specially-designed melody enhancement strategy in the musical encoder. Extensive experiments demonstrate that our TokSing achieves better performance against the Mel spectrogram baselines while offering advantages in intermediate representation space cost and convergence speed.
- [1282] arXiv:2406.08689 (replaced) [pdf, html, other]
-
Title: Security of AI AgentsSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
The study and development of AI agents have been boosted by large language models. AI agents can function as intelligent assistants and complete tasks on behalf of their users with access to tools and the ability to execute commands in their environments, Through studying and experiencing the workflow of typical AI agents, we have raised several concerns regarding their security. These potential vulnerabilities are not addressed by the frameworks used to build the agents, nor by research aimed at improving the agents. In this paper, we identify and describe these vulnerabilities in detail from a system security perspective, emphasizing their causes and severe effects. Furthermore, we introduce defense mechanisms corresponding to each vulnerability with meticulous design and experiments to evaluate their viability. Altogether, this paper contextualizes the security issues in the current development of AI agents and delineates methods to make AI agents safer and more reliable.
- [1283] arXiv:2406.08905 (replaced) [pdf, html, other]
-
Title: SingOMD: Singing Oriented Multi-resolution Discrete Representation Construction from Speech ModelsComments: Accepted by Interspeech 2024Subjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
Discrete representation has shown advantages in speech generation tasks, wherein discrete tokens are derived by discretizing hidden features from self-supervised learning (SSL) pre-trained models. However, the direct application of speech SSL models to singing generation encounters domain gaps between speech and singing. Furthermore, singing generation necessitates a more refined representation than typical speech. To address these challenges, we introduce SingOMD, a novel method to extract singing-oriented multi-resolution discrete representations from speech SSL models. Specifically, we first adapt the features from speech SSL through a resynthesis task and incorporate multi-resolution modules based on resampling to better serve singing generation. These adapted multi-resolution features are then discretized via clustering. Extensive experiments demonstrate the robustness, efficiency, and effectiveness of these representations in singing vocoders and singing voice synthesis.
- [1284] arXiv:2406.08931 (replaced) [pdf, html, other]
-
Title: Exploring Multilingual Unseen Speaker Emotion Recognition: Leveraging Co-Attention Cues in Multitask LearningComments: 5 pages, Accepted to INTERSPEECH 2024. The first two authors contributed equallySubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Advent of modern deep learning techniques has given rise to advancements in the field of Speech Emotion Recognition (SER). However, most systems prevalent in the field fail to generalize to speakers not seen during training. This study focuses on handling challenges of multilingual SER, specifically on unseen speakers. We introduce CAMuLeNet, a novel architecture leveraging co-attention based fusion and multitask learning to address this problem. Additionally, we benchmark pretrained encoders of Whisper, HuBERT, Wav2Vec2.0, and WavLM using 10-fold leave-speaker-out cross-validation on five existing multilingual benchmark datasets: IEMOCAP, RAVDESS, CREMA-D, EmoDB and CaFE and, release a novel dataset for SER on the Hindi language (BhavVani). CAMuLeNet shows an average improvement of approximately 8% over all benchmarks on unseen speakers determined by our cross-validation strategy.
- [1285] arXiv:2406.09009 (replaced) [pdf, html, other]
-
Title: Fredformer: Frequency Debiased Transformer for Time Series ForecastingComments: This paper has been accepted by SIGKDD2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
The Transformer model has shown leading performance in time series forecasting. Nevertheless, in some complex scenarios, it tends to learn low-frequency features in the data and overlook high-frequency features, showing a frequency bias. This bias prevents the model from accurately capturing important high-frequency data features. In this paper, we undertook empirical analyses to understand this bias and discovered that frequency bias results from the model disproportionately focusing on frequency features with higher energy. Based on our analysis, we formulate this bias and propose Fredformer, a Transformer-based framework designed to mitigate frequency bias by learning features equally across different frequency bands. This approach prevents the model from overlooking lower amplitude features important for accurate forecasting. Extensive experiments show the effectiveness of our proposed approach, which can outperform other baselines in different real-world time-series datasets. Furthermore, we introduce a lightweight variant of the Fredformer with an attention matrix approximation, which achieves comparable performance but with much fewer parameters and lower computation costs. The code is available at: this https URL
- [1286] arXiv:2406.09039 (replaced) [pdf, html, other]
-
Title: Language-Driven Closed-Loop Grasping with Model-Predictive Trajectory ReplanningComments: 9 pages, 6 figuresSubjects: Robotics (cs.RO)
Combining a vision module inside a closed-loop control system for a \emph{seamless movement} of a robot in a manipulation task is challenging due to the inconsistent update rates between utilized modules. This task is even more difficult in a dynamic environment, e.g., objects are moving. This paper presents a \emph{modular} zero-shot framework for language-driven manipulation of (dynamic) objects through a closed-loop control system with real-time trajectory replanning and an online 6D object pose localization. We segment an object within $\SI{0.5}{\second}$ by leveraging a vision language model via language commands. Then, guided by natural language commands, a closed-loop system, including a unified pose estimation and tracking and online trajectory planning, is utilized to continuously track this object and compute the optimal trajectory in real-time. Our proposed zero-shot framework provides a smooth trajectory that avoids jerky movements and ensures the robot can grasp a non-stationary object. Experiment results exhibit the real-time capability of the proposed zero-shot modular framework for the trajectory optimization module to accurately and efficiently grasp moving objects, i.e., up to \SI{30}{\hertz} update rates for the online 6D pose localization module and \SI{10}{\hertz} update rates for the receding-horizon trajectory optimization. These advantages highlight the modular framework's potential applications in robotics and human-robot interaction; see the video in this https URL.
- [1287] arXiv:2406.09201 (replaced) [pdf, html, other]
-
Title: Enhanced Object Detection: A Study on Vast Vocabulary Object Detection Track for V3Det Challenge 2024Journal-ref: Second Place in CVPR 2024 Vast Vocabulary Visual Detection ChallengeSubjects: Computer Vision and Pattern Recognition (cs.CV)
In this technical report, we present our findings from the research conducted on the Vast Vocabulary Visual Detection (V3Det) dataset for Supervised Vast Vocabulary Visual Detection task. How to deal with complex categories and detection boxes has become a difficulty in this track. The original supervised detector is not suitable for this task. We have designed a series of improvements, including adjustments to the network structure, changes to the loss function, and design of training strategies. Our model has shown improvement over the baseline and achieved excellent rankings on the Leaderboard for both the Vast Vocabulary Object Detection (Supervised) track and the Open Vocabulary Object Detection (OVD) track of the V3Det Challenge 2024.
- [1288] arXiv:2406.09272 (replaced) [pdf, html, other]
-
Title: Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric VideosComments: Project page: this https URLSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Generating realistic audio for human interactions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinations at test time. We propose a novel ambient-aware audio generation model, AV-LDM. We devise a novel audio-conditioning mechanism to learn to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos. Given a novel silent video, our model uses retrieval-augmented generation to create audio that matches the visual content both semantically and temporally. We train and evaluate our model on two in-the-wild egocentric video datasets Ego4D and EPIC-KITCHENS. Our model outperforms an array of existing methods, allows controllable generation of the ambient sound, and even shows promise for generalizing to computer graphics game clips. Overall, our work is the first to focus video-to-audio generation faithfully on the observed visual content despite training from uncurated clips with natural background sounds.
- [1289] arXiv:2406.09363 (replaced) [pdf, html, other]
-
Title: ElicitationGPT: Text Elicitation Mechanisms via Language ModelsSubjects: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
Scoring rules evaluate probabilistic forecasts of an unknown state against the realized state and are a fundamental building block in the incentivized elicitation of information and the training of machine learning models. This paper develops mechanisms for scoring elicited text against ground truth text using domain-knowledge-free queries to a large language model (specifically ChatGPT) and empirically evaluates their alignment with human preferences. The empirical evaluation is conducted on peer reviews from a peer-grading dataset and in comparison to manual instructor scores for the peer reviews.
- [1290] arXiv:2406.09798 (replaced) [pdf, html, other]
-
Title: Sim-to-Real Transfer via 3D Feature Fields for Vision-and-Language NavigationComments: Submitted to CoRL 2024. The code is available at this https URLSubjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Vision-and-language navigation (VLN) enables the agent to navigate to a remote location in 3D environments following the natural language instruction. In this field, the agent is usually trained and evaluated in the navigation simulators, lacking effective approaches for sim-to-real transfer. The VLN agents with only a monocular camera exhibit extremely limited performance, while the mainstream VLN models trained with panoramic observation, perform better but are difficult to deploy on most monocular robots. For this case, we propose a sim-to-real transfer approach to endow the monocular robots with panoramic traversability perception and panoramic semantic understanding, thus smoothly transferring the high-performance panoramic VLN models to the common monocular robots. In this work, the semantic traversable map is proposed to predict agent-centric navigable waypoints, and the novel view representations of these navigable waypoints are predicted through the 3D feature fields. These methods broaden the limited field of view of the monocular robots and significantly improve navigation performance in the real world. Our VLN system outperforms previous SOTA monocular VLN methods in R2R-CE and RxR-CE benchmarks within the simulation environments and is also validated in real-world environments, providing a practical and high-performance solution for real-world VLN.
- [1291] arXiv:2406.09870 (replaced) [pdf, other]
-
Title: IGL-Bench: Establishing the Comprehensive Benchmark for Imbalanced Graph LearningJiawen Qin, Haonan Yuan, Qingyun Sun, Lyujin Xu, Jiaqi Yuan, Pengfeng Huang, Zhaonan Wang, Xingcheng Fu, Hao Peng, Jianxin Li, Philip S. YuComments: The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Preprint, under review)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Deep graph learning has gained grand popularity over the past years due to its versatility and success in representing graph data across a wide range of domains. However, the pervasive issue of imbalanced graph data distributions, where certain parts exhibit disproportionally abundant data while others remain sparse, undermines the efficacy of conventional graph learning algorithms, leading to biased outcomes. To address this challenge, Imbalanced Graph Learning (IGL) has garnered substantial attention, enabling more balanced data distributions and better task performance. Despite the proliferation of IGL algorithms, the absence of consistent experimental protocols and fair performance comparisons pose a significant barrier to comprehending advancements in this field. To bridge this gap, we introduce IGL-Bench, a foundational comprehensive benchmark for imbalanced graph learning, embarking on 16 diverse graph datasets and 24 distinct IGL algorithms with uniform data processing and splitting strategies. Specifically, IGL-Bench systematically investigates state-of-the-art IGL algorithms in terms of effectiveness, robustness, and efficiency on node-level and graph-level tasks, with the scope of class-imbalance and topology-imbalance. Extensive experiments demonstrate the potential benefits of IGL algorithms on various imbalanced conditions, offering insights and opportunities in the IGL field. Further, we have developed an open-sourced and unified package to facilitate reproducible evaluation and inspire further innovative research, which is available at this https URL.
- [1292] arXiv:2406.09899 (replaced) [pdf, html, other]
-
Title: Learning Solution-Aware Transformers for Efficiently Solving Quadratic Assignment ProblemComments: Accepted by ICML 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Recently various optimization problems, such as Mixed Integer Linear Programming Problems (MILPs), have undergone comprehensive investigation, leveraging the capabilities of machine learning. This work focuses on learning-based solutions for efficiently solving the Quadratic Assignment Problem (QAPs), which stands as a formidable challenge in combinatorial optimization. While many instances of simpler problems admit fully polynomial-time approximate solution (FPTAS), QAP is shown to be strongly NP-hard. Even finding a FPTAS for QAP is difficult, in the sense that the existence of a FPTAS implies $P = NP$. Current research on QAPs suffer from limited scale and computational inefficiency. To attack the aforementioned issues, we here propose the first solution of its kind for QAP in the learn-to-improve category. This work encodes facility and location nodes separately, instead of forming computationally intensive association graphs prevalent in current approaches. This design choice enables scalability to larger problem sizes. Furthermore, a \textbf{S}olution \textbf{AW}are \textbf{T}ransformer (SAWT) architecture integrates the incumbent solution matrix with the attention score to effectively capture higher-order information of the QAPs. Our model's effectiveness is validated through extensive experiments on self-generated QAP instances of varying sizes and the QAPLIB benchmark.
- [1293] arXiv:2406.10453 (replaced) [pdf, html, other]
-
Title: Fast Geometric Learning of MIMO Signal Detection over Grassmannian ManifoldsSubjects: Systems and Control (eess.SY)
Domain or statistical distribution shifts are a key staple of the wireless communication channel, because of the dynamics of the environment. Deep learning (DL) models for detecting multiple-input multiple-output (MIMO) signals in dynamic communication require large training samples (in the order of hundreds of thousands to millions) and online retraining to adapt to domain shift. Some dynamic networks, such as vehicular networks, cannot tolerate the waiting time associated with gathering a large number of training samples or online fine-tuning which incurs significant end-to-end delay. In this paper, a novel classification technique based on the concept of geodesic flow kernel (GFK) is proposed for MIMO signal detection. In particular, received MIMO signals are first represented as points on Grassmannian manifolds by formulating basis of subspaces spanned by the rows vectors of the received signal. Then, the domain shift is modeled using a geodesic flow kernel integrating the subspaces that lie on the geodesic to characterize changes in geometric and statistical properties of the received signals. The kernel derives low-dimensional representations of the received signals over the Grassman manifolds that are invariant to domain shift and is used in a geometric support vector machine (G-SVM) algorithm for MIMO signal detection in an unsupervised manner. Simulation results reveal that the proposed method achieves promising performance against the existing baselines like OAMPnet and MMNet with only 1,200 training samples and without online retraining.
- [1294] arXiv:2406.10584 (replaced) [pdf, html, other]
-
Title: Concentrate Attention: Towards Domain-Generalizable Prompt Optimization for Language ModelsComments: PreprintSubjects: Computation and Language (cs.CL)
Recent advances in prompt optimization have notably enhanced the performance of pre-trained language models (PLMs) on downstream tasks. However, the potential of optimized prompts on domain generalization has been under-explored. To explore the nature of prompt generalization on unknown domains, we conduct pilot experiments and find that (i) Prompts gaining more attention weight from PLMs' deep layers are more generalizable and (ii) Prompts with more stable attention distributions in PLMs' deep layers are more generalizable. Thus, we offer a fresh objective towards domain-generalizable prompts optimization named "Concentration", which represents the "lookback" attention from the current decoding token to the prompt tokens, to increase the attention strength on prompts and reduce the fluctuation of attention distribution. We adapt this new objective to popular soft prompt and hard prompt optimization methods, respectively. Extensive experiments demonstrate that our idea improves comparison prompt optimization methods by 1.42% for soft prompt generalization and 2.16% for hard prompt generalization in accuracy on the multi-source domain generalization setting, while maintaining satisfying in-domain performance. The promising results validate the effectiveness of our proposed prompt optimization objective and provide key insights into domain-generalizable prompts.
- [1295] arXiv:2406.10594 (replaced) [pdf, html, other]
-
Title: BlockPruner: Fine-grained Pruning for Large Language ModelsSubjects: Computation and Language (cs.CL)
With the rapid growth in the size and complexity of large language models (LLMs), the costs associated with their training and inference have escalated significantly. Research indicates that certain layers in LLMs harbor substantial redundancy, and pruning these layers has minimal impact on the overall performance. While various layer pruning methods have been developed based on this insight, they generally overlook the finer-grained redundancies within the layers themselves. In this paper, we delve deeper into the architecture of LLMs and demonstrate that finer-grained pruning can be achieved by targeting redundancies in multi-head attention (MHA) and multi-layer perceptron (MLP) blocks. We propose a novel, training-free structured pruning approach called BlockPruner. Unlike existing layer pruning methods, BlockPruner segments each Transformer layer into MHA and MLP blocks. It then assesses the importance of these blocks using perplexity measures and applies a heuristic search for iterative pruning. We applied BlockPruner to LLMs of various sizes and architectures and validated its performance across a wide range of downstream tasks. Experimental results show that BlockPruner achieves more granular and effective pruning compared to state-of-the-art baselines.
- [1296] arXiv:2406.10671 (replaced) [pdf, other]
-
Title: Augmenting Biomedical Named Entity Recognition with General-domain ResourcesYu Yin, Hyunjae Kim, Xiao Xiao, Chih Hsuan Wei, Jaewoo Kang, Zhiyong Lu, Hua Xu, Meng Fang, Qingyu ChenComments: We make data, codes, and models publicly available via this https URLSubjects: Computation and Language (cs.CL)
Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations. While several studies have employed multi-task learning with multiple BioNER datasets to reduce human effort, this approach does not consistently yield performance improvements and may introduce label ambiguity in different biomedical corpora. We aim to tackle those challenges through transfer learning from easily accessible resources with fewer concept overlaps with biomedical datasets. In this paper, we proposed GERBERA, a simple-yet-effective method that utilized a general-domain NER dataset for training. Specifically, we performed multi-task learning to train a pre-trained biomedical language model with both the target BioNER dataset and the general-domain dataset. Subsequently, we fine-tuned the models specifically for the BioNER dataset. We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances. Despite using fewer biomedical resources, our models demonstrated superior performance compared to baseline models trained with multiple additional BioNER datasets. Specifically, our models consistently outperformed the baselines in six out of eight entity types, achieving an average improvement of 0.9% over the best baseline performance across eight biomedical entity types sourced from five different corpora. Our method was especially effective in amplifying performance on BioNER datasets characterized by limited data, with a 4.7% improvement in F1 scores on the JNLPBA-RNA dataset.
- [1297] arXiv:2406.10690 (replaced) [pdf, html, other]
-
Title: Bridging the Gap in Drug Safety Data Analysis: Large Language Models for SQL Query GenerationComments: 17 pages, 5 figuresSubjects: Artificial Intelligence (cs.AI); Databases (cs.DB)
Pharmacovigilance (PV) is essential for drug safety, primarily focusing on adverse event monitoring. Traditionally, accessing safety data required database expertise, limiting broader use. This paper introduces a novel application of Large Language Models (LLMs) to democratize database access for non-technical users. Utilizing OpenAI's GPT-4, we developed a chatbot that generates structured query language (SQL) queries from natural language, bridging the gap between domain knowledge and technical requirements. The proposed application aims for more inclusive and efficient data access, enhancing decision making in drug safety. By providing LLMs with plain language summaries of expert knowledge, our approach significantly improves query accuracy over methods relying solely on database schemas. The application of LLMs in this context not only optimizes PV data analysis, ensuring timely and precise drug safety reporting -- a crucial component in adverse drug reaction monitoring -- but also promotes safer pharmacological practices and informed decision making across various data intensive fields.
- [1298] arXiv:2406.10719 (replaced) [pdf, html, other]
-
Title: Trading Devil: Robust backdoor attack via Stochastic investment models and Bayesian approachComments: (Last update) Stochastic investment models and a Bayesian approach to better modeling of uncertainty : adversarial machine learning or Stochastic market. arXiv admin note: substantial text overlap with arXiv:2402.05967Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Computational Finance (q-fin.CP); Statistical Finance (q-fin.ST); Machine Learning (stat.ML)
With the growing use of voice-activated systems and speech recognition technologies, the danger of backdoor attacks on audio data has grown significantly. This research looks at a specific type of attack, known as a Stochastic investment-based backdoor attack (MarketBack), in which adversaries strategically manipulate the stylistic properties of audio to fool speech recognition systems. The security and integrity of machine learning models are seriously threatened by backdoor attacks, in order to maintain the reliability of audio applications and systems, the identification of such attacks becomes crucial in the context of audio data. Experimental results demonstrated that MarketBack is feasible to achieve an average attack success rate close to 100% in seven victim models when poisoning less than 1% of the training data.
- [1299] arXiv:2406.10768 (replaced) [pdf, html, other]
-
Title: Rideshare Transparency: Translating Gig Worker Insights on AI Platform Design to PolicySubjects: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
Rideshare platforms exert significant control over workers through algorithmic systems that can result in financial, emotional, and physical harm. What steps can platforms, designers, and practitioners take to mitigate these negative impacts and meet worker needs? In this paper, through a novel mixed methods study combining a LLM-based analysis of over 1 million comments posted to online platform worker communities with semi-structured interviews of workers, we thickly characterize transparency-related harms, mitigation strategies, and worker needs while validating and contextualizing our findings within the broader worker community. Our findings expose a transparency gap between existing platform designs and the information drivers need, particularly concerning promotions, fares, routes, and task allocation. Our analysis suggests that rideshare workers need key pieces of information, which we refer to as indicators, to make informed work decisions. These indicators include details about rides, driver statistics, algorithmic implementation details, and platform policy information. We argue that instead of relying on platforms to include such information in their designs, new regulations that require platforms to publish public transparency reports may be a more effective solution to improve worker well-being. We offer recommendations for implementing such a policy.
- [1300] arXiv:2406.10825 (replaced) [pdf, html, other]
-
Title: Griesmer and Optimal Linear Codes from the Affine Solomon-Stiffler ConstructionComments: 22 pagesSubjects: Information Theory (cs.IT)
In their fundamental paper published in 1965, G. Solomon and J. J. Stiffler invented infinite families of codes meeting the Griesmer bound. These codes are then called Solomon-Stiffler codes and have motivated various constructions of codes meeting or close the Griesmer bound.
In this paper, we give a geometric construction of infinite families of affine and modified affine Solomon-Stiffler codes. Projective Solomon-Stiffler codes are special cases of our modified affine Solomon-Stiffler codes. Several infinite families of $q$-ary Griesmer, optimal, almost optimal two-weight, three-weight, four-weight and five-weight linear codes are constructed as special cases of our construction. Weight distributions of these Griesmer, optimal or almost optimal codes are determined. Many optimal linear codes documented in Grassl's list are re-constructed as (modified) affine Solomon-Stiffler codes.
Several infinite families of optimal or Griesmer codes were constructed in two published papers in IEEE Transactions on Information Theory 2017 and 2019, via Gray images of codes over finite rings. Parameters and weight distributions of these Griesmer or optimal codes and very special case codes in our construction are the same. We also indicate that more general distance-optimal binary linear codes than that constructed in a recent paper of IEEE Transactions on Information Theory can be obtained directly from codimension one subcodes in binary Solomon-Stiffler codes. - [1301] arXiv:2406.10911 (replaced) [pdf, html, other]
-
Title: SingMOS: An extensive Open-Source Singing Voice Dataset for MOS PredictionSubjects: Sound (cs.SD); Audio and Speech Processing (eess.AS)
In speech generation tasks, human subjective ratings, usually referred to as the opinion score, are considered the "gold standard" for speech quality evaluation, with the mean opinion score (MOS) serving as the primary evaluation metric. Due to the high cost of human annotation, several MOS prediction systems have emerged in the speech domain, demonstrating good performance. These MOS prediction models are trained using annotations from previous speech-related challenges. However, compared to the speech domain, the singing domain faces data scarcity and stricter copyright protections, leading to a lack of high-quality MOS-annotated datasets for singing. To address this, we propose SingMOS, a high-quality and diverse MOS dataset for singing, covering a range of Chinese and Japanese datasets. These synthesized vocals are generated using state-of-the-art models in singing synthesis, conversion, or resynthesis tasks and are rated by professional annotators alongside real vocals. Data analysis demonstrates the diversity and reliability of our dataset. Additionally, we conduct further exploration on SingMOS, providing insights for singing MOS prediction and guidance for the continued expansion of SingMOS.
- [1302] arXiv:2406.10937 (replaced) [pdf, html, other]
-
Title: Understanding Understanding: A Pragmatic Framework Motivated by Large Language ModelsSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Motivated by the rapid ascent of Large Language Models (LLMs) and debates about the extent to which they possess human-level qualities, we propose a framework for testing whether any agent (be it a machine or a human) understands a subject matter. In Turing-test fashion, the framework is based solely on the agent's performance, and specifically on how well it answers questions. Elements of the framework include circumscribing the set of questions (the "scope of understanding"), requiring general competence ("passing grade"), avoiding "ridiculous answers", but still allowing wrong and "I don't know" answers to some questions. Reaching certainty about these conditions requires exhaustive testing of the questions which is impossible for nontrivial scopes, but we show how high confidence can be achieved via random sampling and the application of probabilistic confidence bounds. We also show that accompanying answers with explanations can improve the sample complexity required to achieve acceptable bounds, because an explanation of an answer implies the ability to answer many similar questions. According to our framework, current LLMs cannot be said to understand nontrivial domains, but as the framework provides a practical recipe for testing understanding, it thus also constitutes a tool for building AI agents that do understand.
- [1303] arXiv:2406.11087 (replaced) [pdf, html, other]
-
Title: MemDPT: Differential Privacy for Memory Efficient Language ModelsYanming Liu, Xinyue Peng, Jiannan Cao, Yuwei Zhang, Chen Ma, Songhang Deng, Mengchen Fu, Xuhong Zhang, Sheng Cheng, Xun Wang, Jianwei Yin, Tianyu DuComments: 12 pages first versionSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Large language models have consistently demonstrated remarkable performance across a wide spectrum of applications. Nonetheless, the deployment of these models can inadvertently expose user privacy to potential risks. The substantial memory demands of these models during training represent a significant resource consumption challenge. The sheer size of these models imposes a considerable burden on memory resources, which is a matter of significant concern in practice. In this paper, we present an innovative training framework MemDPT that not only reduces the memory cost of large language models but also places a strong emphasis on safeguarding user data privacy. MemDPT provides edge network and reverse network designs to accommodate various differential privacy memory-efficient fine-tuning schemes. Our approach not only achieves $2 \sim 3 \times$ memory optimization but also provides robust privacy protection, ensuring that user data remains secure and confidential. Extensive experiments have demonstrated that MemDPT can effectively provide differential privacy efficient fine-tuning across various task scenarios.
- [1304] arXiv:2406.11097 (replaced) [pdf, html, other]
-
Title: InstructCMP: Length Control in Sentence Compression through Instruction-based Large Language ModelsComments: 8 pages, 3 figures, accepted to ACL 2024 Findings (Long Paper)Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Extractive summarization can produce faithful summaries but often requires additional constraints such as a desired summary length. Traditional sentence compression models do not typically consider the constraints because of their restricted model abilities, which require model modifications for coping with them. To bridge this gap, we propose Instruction-based Compression (InstructCMP), an approach to the sentence compression task that can consider the length constraint through instructions by leveraging the zero-shot task-solving abilities of Large Language Models (LLMs). For this purpose, we created new evaluation datasets by transforming traditional sentence compression datasets into an instruction format. By using the datasets, we first reveal that the current LLMs still face challenges in accurately controlling the length for a compressed text. To address this issue, we propose an approach named "length priming," that incorporates additional length information into the instructions without external resources. While the length priming effectively works in a zero-shot setting, a training dataset with the instructions would further improve the ability of length control. Thus, we additionally created a training dataset in an instruction format to fine-tune the model on it. Experimental results and analysis show that applying the length priming significantly improves performances of InstructCMP in both zero-shot and fine-tuning settings without the need of any model modifications.
- [1305] arXiv:2406.11131 (replaced) [pdf, html, other]
-
Title: Are Large Language Models a Good Replacement of Taxonomies?Comments: Accepted by VLDB 2024Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
Large language models (LLMs) demonstrate an impressive ability to internalize knowledge and answer natural language questions. Although previous studies validate that LLMs perform well on general knowledge while presenting poor performance on long-tail nuanced knowledge, the community is still doubtful about whether the traditional knowledge graphs should be replaced by LLMs. In this paper, we ask if the schema of knowledge graph (i.e., taxonomy) is made obsolete by LLMs. Intuitively, LLMs should perform well on common taxonomies and at taxonomy levels that are common to people. Unfortunately, there lacks a comprehensive benchmark that evaluates the LLMs over a wide range of taxonomies from common to specialized domains and at levels from root to leaf so that we can draw a confident conclusion. To narrow the research gap, we constructed a novel taxonomy hierarchical structure discovery benchmark named TaxoGlimpse to evaluate the performance of LLMs over taxonomies. TaxoGlimpse covers ten representative taxonomies from common to specialized domains with in-depth experiments of different levels of entities in this taxonomy from root to leaf. Our comprehensive experiments of eighteen state-of-the-art LLMs under three prompting settings validate that LLMs can still not well capture the knowledge of specialized taxonomies and leaf-level entities.
- [1306] arXiv:2406.11147 (replaced) [pdf, html, other]
-
Title: Vul-RAG: Enhancing LLM-based Vulnerability Detection via Knowledge-level RAGXueying Du, Geng Zheng, Kaixin Wang, Jiayi Feng, Wentai Deng, Mingwei Liu, Bihuan Chen, Xin Peng, Tao Ma, Yiling LouSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Vulnerability detection is essential for software quality assurance. In recent years, deep learning models (especially large language models) have shown promise in vulnerability detection. In this work, we propose a novel LLM-based vulnerability detection technique Vul-RAG, which leverages knowledge-level retrieval-augmented generation (RAG) framework to detect vulnerability for the given code in three phases. First, Vul-RAG constructs a vulnerability knowledge base by extracting multi-dimension knowledge via LLMs from existing CVE instances; second, for a given code snippet, Vul-RAG} retrieves the relevant vulnerability knowledge from the constructed knowledge base based on functional semantics; third, Vul-RAG leverages LLMs to check the vulnerability of the given code snippet by reasoning the presence of vulnerability causes and fixing solutions of the retrieved vulnerability knowledge. Our evaluation of Vul-RAG on our constructed benchmark PairVul shows that Vul-RAG substantially outperforms all baselines by 12.96\%/110\% relative improvement in accuracy/pairwise-accuracy. In addition, our user study shows that the vulnerability knowledge generated by Vul-RAG can serve as high-quality explanations which can improve the manual detection accuracy from 0.60 to 0.77.
- [1307] arXiv:2406.11151 (replaced) [pdf, html, other]
-
Title: Recent and Upcoming Developments in Randomized Numerical Linear Algebra for Machine LearningSubjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
Large matrices arise in many machine learning and data analysis applications, including as representations of datasets, graphs, model weights, and first and second-order derivatives. Randomized Numerical Linear Algebra (RandNLA) is an area which uses randomness to develop improved algorithms for ubiquitous matrix problems. The area has reached a certain level of maturity; but recent hardware trends, efforts to incorporate RandNLA algorithms into core numerical libraries, and advances in machine learning, statistics, and random matrix theory, have lead to new theoretical and practical challenges. This article provides a self-contained overview of RandNLA, in light of these developments.
- [1308] arXiv:2406.11171 (replaced) [pdf, html, other]
-
Title: SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and Lexical AlterationsComments: Added the dataset link to the abstractSubjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Despite their remarkable successes, state-of-the-art large language models (LLMs), including vision-and-language models (VLMs) and unimodal language models (ULMs), fail to understand precise semantics. For example, semantically equivalent sentences expressed using different lexical compositions elicit diverging representations. The degree of this divergence and its impact on encoded semantics is not very well understood. In this paper, we introduce the SUGARCREPE++ dataset to analyze the sensitivity of VLMs and ULMs to lexical and semantic alterations. Each sample in SUGARCREPE++ dataset consists of an image and a corresponding triplet of captions: a pair of semantically equivalent but lexically different positive captions and one hard negative caption. This poses a 3-way semantic (in)equivalence problem to the language models. We comprehensively evaluate VLMs and ULMs that differ in architecture, pre-training objectives and datasets to benchmark the performance of SUGARCREPE++ dataset. Experimental results highlight the difficulties of VLMs in distinguishing between lexical and semantic variations, particularly in object attributes and spatial relations. Although VLMs with larger pre-training datasets, model sizes, and multiple pre-training objectives achieve better performance on SUGARCREPE++, there is a significant opportunity for improvement. We show that all the models which achieve better performance on compositionality datasets need not perform equally well on SUGARCREPE++, signifying that compositionality alone may not be sufficient for understanding semantic and lexical alterations. Given the importance of the property that the SUGARCREPE++ dataset targets, it serves as a new challenge to the vision-and-language community.
- [1309] arXiv:2406.11173 (replaced) [pdf, html, other]
-
Title: BSRBF-KAN: A combination of B-splines and Radial Basic Functions in Kolmogorov-Arnold NetworksComments: 8 pages, 1 figure, 3 tablesSubjects: Computation and Language (cs.CL)
In this paper, we introduce BSRBF-KAN, a Kolmogorov Arnold Network (KAN) that combines Bsplines and radial basis functions (RBFs) to fit input vectors in data training. We perform experiments with BSRBF-KAN, MLP, and other popular KANs, including EfficientKAN, FastKAN, FasterKAN, and GottliebKAN over the MNIST and Fashion-MNIST datasets. BSRBF-KAN shows stability in 5 training sessions with a competitive average accuracy of 97.55% on MNIST and 89.33% on FashionMNIST and obtains convergence better than other networks. We expect BSRBF-KAN to open many combinations of mathematical functions to design KANs. Our repo is publicly available at: this https URL.
- [1310] arXiv:2406.11213 (replaced) [pdf, html, other]
-
Title: A Survey of AIOps for Failure Management in the Era of Large Language ModelsComments: 34 pagesSubjects: Software Engineering (cs.SE)
As software systems grow increasingly intricate, Artificial Intelligence for IT Operations (AIOps) methods have been widely used in software system failure management to ensure the high availability and reliability of large-scale distributed software systems. However, these methods still face several challenges, such as lack of cross-platform generality and cross-task flexibility. Fortunately, recent advancements in large language models (LLMs) can significantly address these challenges, and many approaches have already been proposed to explore this field. However, there is currently no comprehensive survey that discusses the differences between LLM-based AIOps and traditional AIOps methods. Therefore, this paper presents a comprehensive survey of AIOps technology for failure management in the LLM era. It includes a detailed definition of AIOps tasks for failure management, the data sources for AIOps, and the LLM-based approaches adopted for AIOps. Additionally, this survey explores the AIOps subtasks, the specific LLM-based approaches suitable for different AIOps subtasks, and the challenges and future directions of the domain, aiming to further its development and application.
- [1311] arXiv:2406.11354 (replaced) [pdf, html, other]
-
Title: Preserving Knowledge in Large Language Model with Model-Agnostic Self-DecompressionSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Humans can retain old knowledge while learning new information, but Large Language Models (LLMs) often suffer from catastrophic forgetting when post-pretrained or supervised fine-tuned (SFT) on domain-specific data. Moreover, for Multimodal Large Language Models (MLLMs) which are composed of the LLM base and visual projector (e.g. LLaVA), a significant decline in performance on language benchmarks was observed compared to their single-modality counterparts. To address these challenges, we introduce a novel model-agnostic self-decompression method, Tree Generation (TG), that decompresses knowledge within LLMs into the training corpus. This paper focuses on TG-SFT, which can synthetically generate SFT data for the instruction tuning steps. By incorporating the dumped corpus during SFT for MLLMs, we significantly reduce the forgetting problem.
- [1312] arXiv:2406.11409 (replaced) [pdf, html, other]
-
Title: CodeGemma: Open Code Models Based on GemmaCodeGemma Team: Heri Zhao, Jeffrey Hui, Joshua Howland, Nam Nguyen, Siqi Zuo, Andrea Hu, Christopher A. Choquette-Choo, Jingyue Shen, Joe Kelley, Kshitij Bansal, Luke Vilnis, Mateo Wirth, Paul Michel, Peter Choy, Pratik Joshi, Ravin Kumar, Sarmad Hashmi, Shubham Agrawal, Zhitao Gong, Jane Fine, Tris Warkentin, Ale Jakse Hartman, Bin Ni, Kathy Korevec, Kelly Schaefer, Scott HuffmanComments: v1: 11 pages, 4 figures, 5 tables. v2: Update metadataSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
This paper introduces CodeGemma, a collection of specialized open code models built on top of Gemma, capable of a variety of code and natural language generation tasks. We release three model variants. CodeGemma 7B pretrained (PT) and instruction-tuned (IT) variants have remarkably resilient natural language understanding, excel in mathematical reasoning, and match code capabilities of other open models. CodeGemma 2B is a state-of-the-art code completion model designed for fast code infilling and open-ended generation in latency-sensitive settings.
- [1313] arXiv:2406.11414 (replaced) [pdf, html, other]
-
Title: Formally Certified Approximate Model CountingComments: The extended version, including the appendix, of the paper to be published in CAV24. The associated artifact is available at this https URLSubjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
Approximate model counting is the task of approximating the number of solutions to an input Boolean formula. The state-of-the-art approximate model counter for formulas in conjunctive normal form (CNF), ApproxMC, provides a scalable means of obtaining model counts with probably approximately correct (PAC)-style guarantees. Nevertheless, the validity of ApproxMC's approximation relies on a careful theoretical analysis of its randomized algorithm and the correctness of its highly optimized implementation, especially the latter's stateful interactions with an incremental CNF satisfiability solver capable of natively handling parity (XOR) constraints.
We present the first certification framework for approximate model counting with formally verified guarantees on the quality of its output approximation. Our approach combines: (i) a static, once-off, formal proof of the algorithm's PAC guarantee in the Isabelle/HOL proof assistant; and (ii) dynamic, per-run, verification of ApproxMC's calls to an external CNF-XOR solver using proof certificates. We detail our general approach to establish a rigorous connection between these two parts of the verification, including our blueprint for turning the formalized, randomized algorithm into a verified proof checker, and our design of proof certificates for both ApproxMC and its internal CNF-XOR solving steps. Experimentally, we show that certificate generation adds little overhead to an approximate counter implementation, and that our certificate checker is able to fully certify $84.7\%$ of instances with generated certificates when given the same time and memory limits as the counter. - [1314] arXiv:2406.11565 (replaced) [pdf, html, other]
-
Title: Extrinsic Evaluation of Cultural Competence in Large Language ModelsComments: Under peer reviewSubjects: Computation and Language (cs.CL); Computers and Society (cs.CY)
Productive interactions between diverse users and language technologies require outputs from the latter to be culturally relevant and sensitive. Prior works have evaluated models' knowledge of cultural norms, values, and artifacts, without considering how this knowledge manifests in downstream applications. In this work, we focus on extrinsic evaluation of cultural competence in two text generation tasks, open-ended question answering and story generation. We quantitatively and qualitatively evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts. Although we find that model outputs do vary when varying nationalities and feature culturally relevant words, we also find weak correlations between text similarity of outputs for different countries and the cultural values of these countries. Finally, we discuss important considerations in designing comprehensive evaluation of cultural competence in user-facing tasks.
- [1315] arXiv:2406.11583 (replaced) [pdf, other]
-
Title: Where there's a will there's a way: ChatGPT is used more for science in countries where it is prohibitedComments: Three figures, two tables, 20 pages, and a 17-page appendixSubjects: Digital Libraries (cs.DL); Computers and Society (cs.CY)
Regulating AI has emerged as a key societal challenge, but which methods of regulation are effective is unclear. Here, we measure the effectiveness of restricting AI services geographically using the case of ChatGPT and science. OpenAI prohibits access to ChatGPT from several countries including China and Russia. If the restrictions are effective, there should be minimal use of ChatGPT in prohibited countries. We measured use by developing a classifier based on prior work showing that early versions of ChatGPT overrepresented distinctive words like "delve." We trained the classifier on abstracts before and after ChatGPT "polishing" and validated it on held-out abstracts and those where authors self-declared to have used AI, where it substantially outperformed off-the-shelf LLM detectors GPTZero and ZeroGPT. Applying the classifier to preprints from Arxiv, BioRxiv, and MedRxiv reveals that ChatGPT was used in approximately 12.6% of preprints by August 2023 and use was 7.7% higher in countries without legal access. Crucially, these patterns appeared before the first major legal LLM became widely available in China, the largest restricted-country preprint producer. ChatGPT use was associated with higher views and downloads, but not citations or journal placement. Overall, restricting ChatGPT geographically has proven ineffective in science and possibly other domains, likely due to widespread workarounds.
- [1316] arXiv:2406.11661 (replaced) [pdf, html, other]
-
Title: Cultural Conditioning or Placebo? On the Effectiveness of Socio-Demographic PromptingSagnik Mukherjee, Muhammad Farid Adilazuarda, Sunayana Sitaram, Kalika Bali, Alham Fikri Aji, Monojit ChoudhurySubjects: Computation and Language (cs.CL)
Socio-demographic prompting is a commonly employed approach to study cultural biases in LLMs as well as for aligning models to certain cultures. In this paper, we systematically probe four LLMs (Llama 3, Mistral v0.2, GPT-3.5 Turbo and GPT-4) with prompts that are conditioned on culturally sensitive and non-sensitive cues, on datasets that are supposed to be culturally sensitive (EtiCor and CALI) or neutral (MMLU and ETHICS). We observe that all models except GPT-4 show significant variations in their responses on both kinds of datasets for both kinds of prompts, casting doubt on the robustness of the culturally-conditioned prompting as a method for eliciting cultural bias in models or as an alignment strategy. The work also calls rethinking the control experiment design to tease apart the cultural conditioning of responses from "placebo effect", i.e., random perturbations of model responses due to arbitrary tokens in the prompt.
- [1317] arXiv:2406.11697 (replaced) [pdf, html, other]
-
Title: GridSweep Simulation: Measuring Subsynchronous Impedance Spectra of Distribution FeederComments: 10 pages, 18 figuresSubjects: Systems and Control (eess.SY)
Peaks and troughs in the subsynchronous impedance spectrum of a distribution feeder may be a useful indication of oscillation risk, or more importantly lack of oscillation risk, if inverter-based resource (IBR) deployments are increased on that feeder. GridSweep is a new instrument for measuring the subsynchronous impedance spectra of distribution feeders. It combines an active probing device that modulates a 120-volt 1-kW load sinusoidally at a user-selected GPS-phase locked frequency from 1.0 to 40.0 Hz, and with a recorder that takes ultra-high-precision continuous point-on-wave (CPOW) 120-volt synchrowaveforms at 4 kHz. This paper presents a computer simulation of GridSweep's probing and measurement capability. We construct an electromagnetic transient (EMT) simulation of a single-phase distribution feeder equipped with multiple inverter-based resources (IBRs). We include a model of the GridSweep probing device, then demonstrate the model's capability to measure the subsynchronous apparent impedance spectrum of the feeder. Peaks in that spectrum align with the system's dominant oscillation modes caused by IBRs.
- [1318] arXiv:2406.11723 (replaced) [pdf, html, other]
-
Title: Control of Unknown Quadrotors from a Single ThrowComments: 7 pages, 5 figures, 2 tables. Submitted to the IROS 2024 conferenceSubjects: Robotics (cs.RO); Systems and Control (eess.SY)
This paper presents a method to recover quadrotor UAV from a throw, when no control parameters are known before the throw. We leverage the availability of high-frequency rotor speed feedback available in racing drone hardware and software to find control effectiveness values and fit a motor model using recursive least squares (RLS) estimation. Furthermore, we propose an excitation sequence that provides large actuation commands while guaranteeing to stay within gyroscope sensing limits. After 450ms of excitation, an INDI attitude controller uses the 52 fitted parameters to arrest rotational motion and recover an upright attitude. Finally, a NDI position controller drives the craft to a position setpoint. The proposed algorithm runs efficiently on microcontrollers found in common UAV flight controllers, and was shown to recover an agile quadrotor every time in 57 live experiments with as low as 3.5m throw height, demonstrating robustness against initial rotations and noise. We also demonstrate control of randomized quadrotors in simulated throws, where the parameter fitting RMS error is typically within 10% of the true value.
- [1319] arXiv:2406.11779 (replaced) [pdf, other]
-
Title: Provable Guarantees for Model Performance via Mechanistic InterpretabilityJason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence ChanSubjects: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
In this work, we propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance. We prototype this approach by formally proving lower bounds on the accuracy of 151 small transformers trained on a Max-of-$K$ task. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models. Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding. Moreover, we find that more faithful mechanistic understanding leads to tighter performance bounds. We confirm these connections by qualitatively examining a subset of our proofs. Finally, we identify compounding structureless noise as a key challenge for using mechanistic interpretability to generate compact proofs on model performance.
- [1320] arXiv:2406.11845 (replaced) [pdf, other]
-
Title: Decoding the Digital Fine Print: Navigating the potholes in Terms of service/ use of GenAI tools against the emerging need for Transparent and Trustworthy Tech FuturesSubjects: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
The research investigates the crucial role of clear and intelligible terms of service in cultivating user trust and facilitating informed decision-making in the context of AI, in specific GenAI. It highlights the obstacles presented by complex legal terminology and detailed fine print, which impede genuine user consent and recourse, particularly during instances of algorithmic malfunctions, hazards, damages, or inequities, while stressing the necessity of employing machine-readable terms for effective service licensing. The increasing reliance on General Artificial Intelligence (GenAI) tools necessitates transparent, comprehensible, and standardized terms of use, which facilitate informed decision-making while fostering trust among stakeholders. Despite recent efforts promoting transparency via system and model cards, existing documentation frequently falls short of providing adequate disclosures, leaving users ill-equipped to evaluate potential risks and harms. To address this gap, this research examines key considerations necessary in terms of use or terms of service for Generative AI tools, drawing insights from multiple studies. Subsequently, this research evaluates whether the terms of use or terms of service of prominent Generative AI tools against the identified considerations. Findings indicate inconsistencies and variability in document quality, signaling a pressing demand for uniformity in disclosure practices. Consequently, this study advocates for robust, enforceable standards ensuring complete and intelligible disclosures prior to the release of GenAI tools, thereby empowering end-users to make well-informed choices and enhancing overall accountability in the field.
- [1321] arXiv:2406.11920 (replaced) [pdf, html, other]
-
Title: Job-SDF: A Multi-Granularity Dataset for Job Skill Demand Forecasting and BenchmarkingSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
In a rapidly evolving job market, skill demand forecasting is crucial as it enables policymakers and businesses to anticipate and adapt to changes, ensuring that workforce skills align with market needs, thereby enhancing productivity and competitiveness. Additionally, by identifying emerging skill requirements, it directs individuals towards relevant training and education opportunities, promoting continuous self-learning and development. However, the absence of comprehensive datasets presents a significant challenge, impeding research and the advancement of this field. To bridge this gap, we present Job-SDF, a dataset designed to train and benchmark job-skill demand forecasting models. Based on 10.35 million public job advertisements collected from major online recruitment platforms in China between 2021 and 2023, this dataset encompasses monthly recruitment demand for 2,324 types of skills across 521 companies. Our dataset uniquely enables evaluating skill demand forecasting models at various granularities, including occupation, company, and regional levels. We benchmark a range of models on this dataset, evaluating their performance in standard scenarios, in predictions focused on lower value ranges, and in the presence of structural breaks, providing new insights for further research. Our code and dataset are publicly accessible via the this https URL.
- [1322] arXiv:2406.11927 (replaced) [pdf, html, other]
-
Title: REPOEXEC: Evaluate Code Generation with a Repository-Level Executable BenchmarkSubjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
The ability of CodeLLMs to generate executable and functionally correct code at the repository-level scale remains largely unexplored. We introduce RepoExec, a novel benchmark for evaluating code generation at the repository-level scale. RepoExec focuses on three main aspects: executability, functional correctness through automated test case generation with high coverage rate, and carefully crafted cross-file contexts to accurately generate code. Our work explores a controlled scenario where developers specify necessary code dependencies, challenging the model to integrate these accurately. Experiments show that while pretrained LLMs outperform instruction-tuned models in correctness, the latter excel in utilizing provided dependencies and demonstrating debugging capabilities. We also introduce a new instruction-tuned dataset that focuses on code dependencies and demonstrate that CodeLLMs fine-tuned on our dataset have a better capability to leverage these dependencies effectively. RepoExec aims to provide a comprehensive evaluation of code functionality and alignment with developer intent, paving the way for more reliable and applicable CodeLLMs in real-world scenarios. The dataset and source code can be found at~\url{this https URL}.
- [1323] arXiv:2406.12033 (replaced) [pdf, html, other]
-
Title: Unveiling and Mitigating Bias in Mental Health Analysis with Large Language ModelsYuqing Wang, Yun Zhao, Sara Alessandra Keller, Anne de Hond, Marieke M. van Buchem, Malvika Pillai, Tina Hernandez-BoussardComments: In submission; Data and code are available at: this https URLSubjects: Computation and Language (cs.CL)
The advancement of large language models (LLMs) has demonstrated strong capabilities across various applications, including mental health analysis. However, existing studies have focused on predictive performance, leaving the critical issue of fairness underexplored, posing significant risks to vulnerable populations. Despite acknowledging potential biases, previous works have lacked thorough investigations into these biases and their impacts. To address this gap, we systematically evaluate biases across seven social factors (e.g., gender, age, religion) using ten LLMs with different prompting methods on eight diverse mental health datasets. Our results show that GPT-4 achieves the best overall balance in performance and fairness among LLMs, although it still lags behind domain-specific models like MentalRoBERTa in some cases. Additionally, our tailored fairness-aware prompts can effectively mitigate bias in mental health predictions, highlighting the great potential for fair analysis in this field.
- [1324] arXiv:2406.12058 (replaced) [pdf, html, other]
-
Title: WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness DimensionsComments: 26 pages, including reference and appendix sections, 8 figures, and 16 tablesSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Language Models (LMs) are being proposed for mental health applications where the heightened risk of adverse outcomes means predictive performance may not be a sufficient litmus test of a model's utility in clinical practice. A model that can be trusted for practice should have a correspondence between explanation and clinical determination, yet no prior research has examined the attention fidelity of these models and their effect on ground truth explanations. We introduce an evaluation design that focuses on the robustness and explainability of LMs in identifying Wellness Dimensions (WD). We focus on two mental health and well-being datasets: (a) Multi-label Classification-based MultiWD, and (b) WellXplain for evaluating attention mechanism veracity against expert-labeled explanations. The labels are based on Halbert Dunn's theory of wellness, which gives grounding to our evaluation. We reveal four surprising results about LMs/LLMs: (1) Despite their human-like capabilities, GPT-3.5/4 lag behind RoBERTa, and MedAlpaca, a fine-tuned LLM fails to deliver any remarkable improvements in performance or explanations. (2) Re-examining LMs' predictions based on a confidence-oriented loss function reveals a significant performance drop. (3) Across all LMs/LLMs, the alignment between attention and explanations remains low, with LLMs scoring a dismal 0.0. (4) Most mental health-specific LMs/LLMs overlook domain-specific knowledge and undervalue explanations, causing these discrepancies. This study highlights the need for further research into their consistency and explanations in mental health and well-being.
- [1325] arXiv:2406.12066 (replaced) [pdf, html, other]
-
Title: Language Models are Surprisingly Fragile to Drug Names in Biomedical BenchmarksJack Gallifant, Shan Chen, Pedro Moreira, Nikolaj Munch, Mingye Gao, Jackson Pond, Leo Anthony Celi, Hugo Aerts, Thomas Hartvigsen, Danielle BittermanComments: submitted for review, total 15 pagesSubjects: Computation and Language (cs.CL)
Medical knowledge is context-dependent and requires consistent reasoning across various natural language expressions of semantically equivalent phrases. This is particularly crucial for drug names, where patients often use brand names like Advil or Tylenol instead of their generic equivalents. To study this, we create a new robustness dataset, RABBITS, to evaluate performance differences on medical benchmarks after swapping brand and generic drug names using physician expert annotations.
We assess both open-source and API-based LLMs on MedQA and MedMCQA, revealing a consistent performance drop ranging from 1-10\%. Furthermore, we identify a potential source of this fragility as the contamination of test data in widely used pre-training datasets. All code is accessible at this https URL, and a HuggingFace leaderboard is available at this https URL. - [1326] arXiv:2406.12072 (replaced) [pdf, html, other]
-
Title: DTGB: A Comprehensive Benchmark for Dynamic Text-Attributed GraphsComments: 28 pages, 13 figuresSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Dynamic text-attributed graphs (DyTAGs) are prevalent in various real-world scenarios, where each node and edge are associated with text descriptions, and both the graph structure and text descriptions evolve over time. Despite their broad applicability, there is a notable scarcity of benchmark datasets tailored to DyTAGs, which hinders the potential advancement in many research fields. To address this gap, we introduce Dynamic Text-attributed Graph Benchmark (DTGB), a collection of large-scale, time-evolving graphs from diverse domains, with nodes and edges enriched by dynamically changing text attributes and categories. To facilitate the use of DTGB, we design standardized evaluation procedures based on four real-world use cases: future link prediction, destination node retrieval, edge classification, and textual relation generation. These tasks require models to understand both dynamic graph structures and natural language, highlighting the unique challenges posed by DyTAGs. Moreover, we conduct extensive benchmark experiments on DTGB, evaluating 7 popular dynamic graph learning algorithms and their variants of adapting to text attributes with LLM embeddings, along with 6 powerful large language models (LLMs). Our results show the limitations of existing models in handling DyTAGs. Our analysis also demonstrates the utility of DTGB in investigating the incorporation of structural and textual dynamics. The proposed DTGB fosters research on DyTAGs and their broad applications. It offers a comprehensive benchmark for evaluating and advancing models to handle the interplay between dynamic graph structures and natural language. The dataset and source code are available at this https URL.
- [1327] arXiv:2406.12091 (replaced) [pdf, html, other]
-
Title: Is poisoning a real threat to LLM alignment? Maybe more so than you thinkJournal-ref: ICML 2024 Workshop MHFAIASubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
Recent advancements in Reinforcement Learning with Human Feedback (RLHF) have significantly impacted the alignment of Large Language Models (LLMs). The sensitivity of reinforcement learning algorithms such as Proximal Policy Optimization (PPO) has led to new line work on Direct Policy Optimization (DPO), which treats RLHF in a supervised learning framework. The increased practical use of these RLHF methods warrants an analysis of their vulnerabilities. In this work, we investigate the vulnerabilities of DPO to poisoning attacks under different scenarios and compare the effectiveness of preference poisoning, a first of its kind. We comprehensively analyze DPO's vulnerabilities under different types of attacks, i.e., backdoor and non-backdoor attacks, and different poisoning methods across a wide array of language models, i.e., LLama 7B, Mistral 7B, and Gemma 7B. We find that unlike PPO-based methods, which, when it comes to backdoor attacks, require at least 4\% of the data to be poisoned to elicit harmful behavior, we exploit the true vulnerabilities of DPO more simply so we can poison the model with only as much as 0.5\% of the data. We further investigate the potential reasons behind the vulnerability and how well this vulnerability translates into backdoor vs non-backdoor attacks.
- [1328] arXiv:2406.12121 (replaced) [pdf, html, other]
-
Title: TutteNet: Injective 3D Deformations by Composition of 2D Mesh DeformationsSubjects: Computer Vision and Pattern Recognition (cs.CV)
This work proposes a novel representation of injective deformations of 3D space, which overcomes existing limitations of injective methods: inaccuracy, lack of robustness, and incompatibility with general learning and optimization frameworks. The core idea is to reduce the problem to a deep composition of multiple 2D mesh-based piecewise-linear maps. Namely, we build differentiable layers that produce mesh deformations through Tutte's embedding (guaranteed to be injective in 2D), and compose these layers over different planes to create complex 3D injective deformations of the 3D volume. We show our method provides the ability to efficiently and accurately optimize and learn complex deformations, outperforming other injective approaches. As a main application, we produce complex and artifact-free NeRF and SDF deformations.
- [1329] arXiv:2406.12168 (replaced) [pdf, html, other]
-
Title: BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLMComments: Wenda Xu and Jiachen Li contributed equallySubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Direct alignment from preferences (DAP) has emerged as a promising paradigm for aligning large language models (LLMs) to human desiderata from pre-collected, offline preference datasets. While recent studies indicate that existing offline DAP methods can directly benefit from online training samples, we highlight the need to develop specific online DAP algorithms to fully harness the power of online training. Specifically, we identify that the learned LLM should adhere to the proximity of the behavior LLM, which collects the training samples. To this end, we propose online Preference Optimization in proximity to the Behavior LLM (BPO), emphasizing the importance of constructing a proper trust region for LLM alignment.
We conduct extensive experiments to validate the effectiveness and applicability of our approach by integrating it with various DAP methods, resulting in significant performance improvements across a wide range of tasks when training with the same amount of preference data. Even when only introducing one additional data collection phase, our online BPO improves its offline DAP baseline from 72.0% to 80.2% on TL;DR and from 82.2% to 89.1% on Anthropic Helpfulness in terms of win rate against human reference text. - [1330] arXiv:2406.12173 (replaced) [pdf, html, other]
-
Title: MiSuRe is all you need to explain your image segmentationSubjects: Computer Vision and Pattern Recognition (cs.CV)
The last decade of computer vision has been dominated by Deep Learning architectures, thanks to their unparalleled success. Their performance, however, often comes at the cost of explainability owing to their highly non-linear nature. Consequently, a parallel field of eXplainable Artificial Intelligence (XAI) has developed with the aim of generating insights regarding the decision making process of deep learning models. An important problem in XAI is that of the generation of saliency maps. These are regions in an input image which contributed most towards the model's final decision. Most work in this regard, however, has been focused on image classification, and image segmentation - despite being a ubiquitous task - has not received the same attention. In the present work, we propose MiSuRe (Minimally Sufficient Region) as an algorithm to generate saliency maps for image segmentation. The goal of the saliency maps generated by MiSuRe is to get rid of irrelevant regions, and only highlight those regions in the input image which are crucial to the image segmentation decision. We perform our analysis on 3 datasets: Triangle (artificially constructed), COCO-2017 (natural images), and the Synapse multi-organ (medical images). Additionally, we identify a potential usecase of these post-hoc saliency maps in order to perform post-hoc reliability of the segmentation model.
- [1331] arXiv:2406.12196 (replaced) [pdf, html, other]
-
Title: CITADEL: Context Similarity Based Deep Learning Framework Bug FindingComments: 12 pages, 10 figuresSubjects: Software Engineering (cs.SE)
With deep learning (DL) technology becoming an integral part of the new intelligent software, tools of DL framework testing and bug-finding are in high demand. Existing DL framework testing tools have limited coverage on bug types. For example, they lack the capability of finding performance bugs, which are critical for DL model training and inference regarding performance, economics, and the environment. This problem is challenging due to the difficulty of getting test oracles of performance bugs. Moreover, existing tools are inefficient, generating hundreds of test cases with few trigger bugs. In this paper, we propose CITADEL, a method that accelerates the finding of bugs in terms of efficiency and effectiveness. We observe that many DL framework bugs are similar due to the similarity of operators and algorithms belonging to the same family (e.g., Conv2D and Conv3D). Orthogonal to existing bug-finding tools, CITADEL aims to find new bugs that are similar to reported ones that have known test oracles. It works by first collecting existing bug reports and identifying problematic APIs. CITADEL defines context similarity to measure the similarity of DL framework API pairs and automatically generates test cases with oracles for APIs that are similar to the problematic APIs in existing bug reports. CITADEL respectively covers 1,436 PyTorch and 5,380 TensorFlow APIs and effectively detects 79 and 80 API bugs, among which 58 and 68 are new, and 36 and 58 have been confirmed, many of which, e.g., the 11 performance bugs cannot be detected by existing tools. Moreover, a remarkable 35.40% of the test cases generated by CITADEL can trigger bugs, which significantly transcends the ratios of 0.74%, 1.23%, and 3.90% exhibited by the state-of-the-art methods, DocTer, DeepREL, and TitanFuzz.
- [1332] arXiv:2406.12214 (replaced) [pdf, html, other]
-
Title: Is Your HD Map Constructor Reliable under Sensor Corruptions?Xiaoshuai Hao, Mengchuan Wei, Yifan Yang, Haimei Zhao, Hui Zhang, Yi Zhou, Qiang Wang, Weiming Li, Lingdong Kong, Jing ZhangComments: project url: this https URLSubjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Driving systems often rely on high-definition (HD) maps for precise environmental information, which is crucial for planning and navigation. While current HD map constructors perform well under ideal conditions, their resilience to real-world challenges, \eg, adverse weather and sensor failures, is not well understood, raising safety concerns. This work introduces MapBench, the first comprehensive benchmark designed to evaluate the robustness of HD map construction methods against various sensor corruptions. Our benchmark encompasses a total of 29 types of corruptions that occur from cameras and LiDAR sensors. Extensive evaluations across 31 HD map constructors reveal significant performance degradation of existing methods under adverse weather conditions and sensor failures, underscoring critical safety concerns. We identify effective strategies for enhancing robustness, including innovative approaches that leverage multi-modal fusion, advanced data augmentation, and architectural techniques. These insights provide a pathway for developing more reliable HD map construction methods, which are essential for the advancement of autonomous driving technology. The benchmark toolkit and affiliated code and model checkpoints have been made publicly accessible.
- [1333] arXiv:2406.12246 (replaced) [pdf, html, other]
-
Title: TroL: Traversal of Layers for Large Language and Vision ModelsComments: Code is available in this https URLSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Large language and vision models (LLVMs) have been driven by the generalization power of large language models (LLMs) and the advent of visual instruction tuning. Along with scaling them up directly, these models enable LLVMs to showcase powerful vision language (VL) performances by covering diverse tasks via natural language instructions. However, existing open-source LLVMs that perform comparably to closed-source LLVMs such as GPT-4V are often considered too large (e.g., 26B, 34B, and 110B parameters), having a larger number of layers. These large models demand costly, high-end resources for both training and inference. To address this issue, we present a new efficient LLVM family with 1.8B, 3.8B, and 7B LLM model sizes, Traversal of Layers (TroL), which enables the reuse of layers in a token-wise manner. This layer traversing technique simulates the effect of looking back and retracing the answering stream while increasing the number of forward propagation layers without physically adding more layers. We demonstrate that TroL employs a simple layer traversing approach yet efficiently outperforms the open-source LLVMs with larger model sizes and rivals the performances of the closed-source LLVMs with substantial sizes.
- [1334] arXiv:2406.12272 (replaced) [pdf, html, other]
-
Title: Slot State Space ModelsSubjects: Artificial Intelligence (cs.AI)
Recent State Space Models (SSMs) such as S4, S5, and Mamba have shown remarkable computational benefits in long-range temporal dependency modeling. However, in many sequence modeling problems, the underlying process is inherently modular and it is of interest to have inductive biases that mimic this modular structure. In this paper, we introduce SlotSSMs, a novel framework for incorporating independent mechanisms into SSMs to preserve or encourage separation of information. Unlike conventional SSMs that maintain a monolithic state vector, SlotSSMs maintains the state as a collection of multiple vectors called slots. Crucially, the state transitions are performed independently per slot with sparse interactions across slots implemented via the bottleneck of self-attention. In experiments, we evaluate our model in object-centric video understanding, 3D visual reasoning, and video prediction tasks, which involve modeling multiple objects and their long-range temporal dependencies. We find that our proposed design offers substantial performance gains over existing sequence modeling methods.
- [1335] arXiv:2406.12356 (replaced) [pdf, html, other]
-
Title: A Gradient Accumulation Method for Dense Retriever under Memory ConstraintSubjects: Information Retrieval (cs.IR)
InfoNCE loss is commonly used to train dense retriever in information retrieval tasks. It is well known that a large batch is essential to stable and effective training with InfoNCE loss, which requires significant hardware resources. Due to the dependency of large batch, dense retriever has bottleneck of application and research. Recently, memory reduction methods have been broadly adopted to resolve the hardware bottleneck by decomposing forward and backward or using a memory bank. However, current methods still suffer from slow and unstable training. To address these issues, we propose Contrastive Accumulation (ContAccum), a stable and efficient memory reduction method for dense retriever trains that uses a dual memory bank structure to leverage previously generated query and passage representations. Experiments on widely used five information retrieval datasets indicate that ContAccum can surpass not only existing memory reduction methods but also high-resource scenario. Moreover, theoretical analysis and experimental results confirm that ContAccum provides more stable dual-encoder training than current memory bank utilization methods.
- [1336] arXiv:2406.12376 (replaced) [pdf, html, other]
-
Title: DCS Chain: A Flexible Private Blockchain SystemSubjects: Cryptography and Security (cs.CR)
Blockchain technology has seen tremendous development over the past few years. Despite the emergence of numerous blockchain systems, they all suffer from various limitations, which can all be attributed to the fundamental issue posed by the DCS trilemma. In light of this, this work introduces a novel private blockchain system named DCS Chain. The core idea is to quantify the DCS metrics and dynamically adjust the blockchain's performance across these three dimensions, to achieve theoretically optimal system performance. Overall, our system provides a comprehensive suite of blockchain essentials, including DCS quantification, consensus protocol adjustment, and communication network simulation.
- [1337] arXiv:2406.12433 (replaced) [pdf, html, other]
-
Title: LLM-enhanced Reranking in Recommender SystemsJingtong Gao, Bo Chen, Xiangyu Zhao, Weiwen Liu, Xiangyang Li, Yichao Wang, Zijian Zhang, Wanyu Wang, Yuyang Ye, Shanru Lin, Huifeng Guo, Ruiming TangSubjects: Information Retrieval (cs.IR)
Reranking is a critical component in recommender systems, playing an essential role in refining the output of recommendation algorithms. Traditional reranking models have focused predominantly on accuracy, but modern applications demand consideration of additional criteria such as diversity and fairness. Existing reranking approaches often fail to harmonize these diverse criteria effectively at the model level. Moreover, these models frequently encounter challenges with scalability and personalization due to their complexity and the varying significance of different reranking criteria in diverse scenarios. In response, we introduce a comprehensive reranking framework enhanced by LLM, designed to seamlessly integrate various reranking criteria while maintaining scalability and facilitating personalized recommendations. This framework employs a fully connected graph structure, allowing the LLM to simultaneously consider multiple aspects such as accuracy, diversity, and fairness through a coherent Chain-of-Thought (CoT) process. A customizable input mechanism is also integrated, enabling the tuning of the language model's focus to meet specific reranking needs. We validate our approach using three popular public datasets, where our framework demonstrates superior performance over existing state-of-the-art reranking models in balancing multiple criteria. The code for this implementation is publicly available.
- [1338] arXiv:2406.12443 (replaced) [pdf, html, other]
-
Title: Robustness Testing of Multi-Modal Models in Varied Home Environments for Assistive RobotsComments: Geriatronics Summit 2024, July 09 - 10, Garmisch-Partenkirchen Congress CenterSubjects: Robotics (cs.RO)
The development of assistive robotic agents to support household tasks is advancing, yet the underlying models often operate in virtual settings that do not reflect real-world complexity. For assistive care robots to be effective in diverse environments, their models must be robust and integrate multiple modalities. Consider a caretaker needing assistance in a dimly lit room or navigating around a newly installed glass door. Models relying solely on visual input might fail in low light, while those using depth information could avoid the door. This demonstrates the necessity for models that can process various sensory inputs. Our ongoing study evaluates state-of-the-art robotic models in the AI2Thor virtual environment. We introduce disturbances, such as dimmed lighting and mirrored walls, to assess their impact on modalities like movement or vision, and object recognition. Our goal is to gather input from the Geriatronics community to understand and model the challenges faced by practitioners.
- [1339] arXiv:2406.12534 (replaced) [pdf, html, other]
-
Title: Unified Active Retrieval for Retrieval Augmented GenerationQinyuan Cheng, Xiaonan Li, Shimin Li, Qin Zhu, Zhangyue Yin, Yunfan Shao, Linyang Li, Tianxiang Sun, Hang Yan, Xipeng QiuSubjects: Computation and Language (cs.CL)
In Retrieval-Augmented Generation (RAG), retrieval is not always helpful and applying it to every instruction is sub-optimal. Therefore, determining whether to retrieve is crucial for RAG, which is usually referred to as Active Retrieval. However, existing active retrieval methods face two challenges: 1. They usually rely on a single criterion, which struggles with handling various types of instructions. 2. They depend on specialized and highly differentiated procedures, and thus combining them makes the RAG system more complicated and leads to higher response latency. To address these challenges, we propose Unified Active Retrieval (UAR). UAR contains four orthogonal criteria and casts them into plug-and-play classification tasks, which achieves multifaceted retrieval timing judgements with negligible extra inference cost. We further introduce the Unified Active Retrieval Criteria (UAR-Criteria), designed to process diverse active retrieval scenarios through a standardized procedure. Experiments on four representative types of user instructions show that UAR significantly outperforms existing work on the retrieval timing judgement and the performance of downstream tasks, which shows the effectiveness of UAR and its helpfulness to downstream tasks.
- [1340] arXiv:2406.12572 (replaced) [pdf, html, other]
-
Title: Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language ModelsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
We introduce Mathador-LM, a new benchmark for evaluating the mathematical reasoning on large language models (LLMs), combining ruleset interpretation, planning, and problem-solving. This benchmark is inspired by the Mathador game, where the objective is to reach a target number using basic arithmetic operations on a given set of base numbers, following a simple set of rules. We show that, across leading LLMs, we obtain stable average performance while generating benchmark instances dynamically, following a target difficulty level. Thus, our benchmark alleviates concerns about test-set leakage into training data, an issue that often undermines popular benchmarks. Additionally, we conduct a comprehensive evaluation of both open and closed-source state-of-the-art LLMs on Mathador-LM. Our findings reveal that contemporary models struggle with Mathador-LM, scoring significantly lower than average 3rd graders. This stands in stark contrast to their strong performance on popular mathematical reasoning benchmarks.
- [1341] arXiv:2406.12574 (replaced) [pdf, other]
-
Title: A Behavioral Theory for Distributed Systems with Weak RecoveryComments: 50 pagesSubjects: Logic in Computer Science (cs.LO)
Distributed systems can be subject to various kinds of partial failures, therefore building fault-tolerance or failure mitigation mechanisms for distributed systems remains an important domain of research. In this paper, we present a calculus to formally model distributed systems subject to crash failures with recovery. The recovery model considered in the paper is weak, in the sense that it makes no assumption on the exact state in which a failed node resumes its execution, only its identity has to be distinguishable from past incarnations of itself. Our calculus is inspired in part by the Erlang programming language and in part by the distributed $\pi$-calculus with nodes and link failures (D$\pi$F) introduced by Francalanza and Hennessy. In order to reason about distributed systems with failures and recovery we develop a behavioral theory for our calculus, in the form of a contextual equivalence, and of a fully abstract coinductive characterization of this equivalence by means of a labelled transition system semantics and its associated weak bisimilarity. This result is valuable for it provides a compositional proof technique for proving or disproving contextual equivalence between systems.
- [1342] arXiv:2406.12649 (replaced) [pdf, html, other]
-
Title: Probabilistic Conceptual Explainers: Trustworthy Conceptual Explanations for Vision Foundation ModelsComments: Accepted at ICML 2024Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
Vision transformers (ViTs) have emerged as a significant area of focus, particularly for their capacity to be jointly trained with large language models and to serve as robust vision foundation models. Yet, the development of trustworthy explanation methods for ViTs has lagged, particularly in the context of post-hoc interpretations of ViT predictions. Existing sub-image selection approaches, such as feature-attribution and conceptual models, fall short in this regard. This paper proposes five desiderata for explaining ViTs -- faithfulness, stability, sparsity, multi-level structure, and parsimony -- and demonstrates the inadequacy of current methods in meeting these criteria comprehensively. We introduce a variational Bayesian explanation framework, dubbed ProbAbilistic Concept Explainers (PACE), which models the distributions of patch embeddings to provide trustworthy post-hoc conceptual explanations. Our qualitative analysis reveals the distributions of patch-level concepts, elucidating the effectiveness of ViTs by modeling the joint distribution of patch embeddings and ViT's predictions. Moreover, these patch-level explanations bridge the gap between image-level and dataset-level explanations, thus completing the multi-level structure of PACE. Through extensive experiments on both synthetic and real-world datasets, we demonstrate that PACE surpasses state-of-the-art methods in terms of the defined desiderata.
- [1343] arXiv:2406.12746 (replaced) [pdf, html, other]
-
Title: Rationale-based Ensemble of Multiple QA Strategies for Zero-shot Knowledge-based VQASubjects: Computation and Language (cs.CL)
Knowledge-based Visual Qustion-answering (K-VQA) necessitates the use of background knowledge beyond what is depicted in the image. Current zero-shot K-VQA methods usually translate an image to a single type of textual decision context and use a text-based model to answer the question based on it, which conflicts with the fact that K-VQA questions often require the combination of multiple question-answering strategies. In light of this, we propose Rationale-based Ensemble of Answer Context Tactics (REACT) to achieve a dynamic ensemble of multiple question-answering tactics, comprising Answer Candidate Generation (ACG) and Rationale-based Strategy Fusion (RSF). In ACG, we generate three distinctive decision contexts to provide different strategies for each question, resulting in the generation of three answer candidates. RSF generates automatic and mechanistic rationales from decision contexts for each candidate, allowing the model to select the correct answer from all candidates. We conduct comprehensive experiments on the OK-VQA and A-OKVQA datasets, and our method significantly outperforms state-of-the-art LLM-based baselines on all datasets.
- [1344] arXiv:2406.12770 (replaced) [pdf, html, other]
-
Title: Informatics & dairy industry coalition: AI trends and present challengesSubjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Artificial Intelligence (AI) can potentially transform the industry, enhancing the production process and minimizing manual, repetitive tasks. Accordingly, the synergy between high-performance computing and powerful mathematical models enables the application of sophisticated data analysis procedures like Machine Learning. However, challenges exist regarding effective, efficient, and flexible processing to generate valuable knowledge. Consequently, this work comprehensively describes industrial challenges where AI can be exploited, focusing on the dairy industry. The conclusions presented can help researchers apply novel approaches for cattle monitoring and farmers by proposing advanced technological solutions to their needs.
- [1345] arXiv:2406.12805 (replaced) [pdf, html, other]
-
Title: AITTI: Learning Adaptive Inclusive Token for Text-to-Image GenerationSubjects: Computer Vision and Pattern Recognition (cs.CV)
Despite the high-quality results of text-to-image generation, stereotypical biases have been spotted in their generated contents, compromising the fairness of generative models. In this work, we propose to learn adaptive inclusive tokens to shift the attribute distribution of the final generative outputs. Unlike existing de-biasing approaches, our method requires neither explicit attribute specification nor prior knowledge of the bias distribution. Specifically, the core of our method is a lightweight adaptive mapping network, which can customize the inclusive tokens for the concepts to be de-biased, making the tokens generalizable to unseen concepts regardless of their original bias distributions. This is achieved by tuning the adaptive mapping network with a handful of balanced and inclusive samples using an anchor loss. Experimental results demonstrate that our method outperforms previous bias mitigation methods without attribute specification while preserving the alignment between generative results and text descriptions. Moreover, our method achieves comparable performance to models that require specific attributes or editing directions for generation. Extensive experiments showcase the effectiveness of our adaptive inclusive tokens in mitigating stereotypical bias in text-to-image generation. The code will be available at this https URL.
- [1346] arXiv:2406.12808 (replaced) [pdf, html, other]
-
Title: Graph Neural Networks in Histopathology: Emerging Trends and Future DirectionsSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Tissues and Organs (q-bio.TO)
Histopathological analysis of Whole Slide Images (WSIs) has seen a surge in the utilization of deep learning methods, particularly Convolutional Neural Networks (CNNs). However, CNNs often fall short in capturing the intricate spatial dependencies inherent in WSIs. Graph Neural Networks (GNNs) present a promising alternative, adept at directly modeling pairwise interactions and effectively discerning the topological tissue and cellular structures within WSIs. Recognizing the pressing need for deep learning techniques that harness the topological structure of WSIs, the application of GNNs in histopathology has experienced rapid growth. In this comprehensive review, we survey GNNs in histopathology, discuss their applications, and explore emerging trends that pave the way for future advancements in the field. We begin by elucidating the fundamentals of GNNs and their potential applications in histopathology. Leveraging quantitative literature analysis, we identify four emerging trends: Hierarchical GNNs, Adaptive Graph Structure Learning, Multimodal GNNs, and Higher-order GNNs. Through an in-depth exploration of these trends, we offer insights into the evolving landscape of GNNs in histopathological analysis. Based on our findings, we propose future directions to propel the field forward. Our analysis serves to guide researchers and practitioners towards innovative approaches and methodologies, fostering advancements in histopathological analysis through the lens of graph neural networks.
- [1347] arXiv:1906.05757 (replaced) [pdf, html, other]
-
Title: The rank of sparse random matricesComments: This article supersedes arXiv:1810.07390Subjects: Combinatorics (math.CO); Discrete Mathematics (cs.DM); Probability (math.PR)
We determine the rank of a random matrix over an arbitrary field with prescribed numbers of non-zero entries in each row and column. As an application we obtain a formula for the rate of low-density parity check codes. This formula vindicates a conjecture of Lelarge (2013). The proofs are based on coupling arguments and a novel random perturbation, applicable to any matrix, that diminishes the number of short linear relations.
- [1348] arXiv:2206.02909 (replaced) [pdf, html, other]
-
Title: Self-supervised Learning for Human Activity Recognition Using 700,000 Person-days of Wearable DataHang Yuan, Shing Chan, Andrew P. Creagh, Catherine Tong, Aidan Acquah, David A. Clifton, Aiden DohertyJournal-ref: npj Digit. Med. 7, 91 (2024)Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Advances in deep learning for human activity recognition have been relatively limited due to the lack of large labelled datasets. In this study, we leverage self-supervised learning techniques on the UK-Biobank activity tracker dataset--the largest of its kind to date--containing more than 700,000 person-days of unlabelled wearable sensor data. Our resulting activity recognition model consistently outperformed strong baselines across seven benchmark datasets, with an F1 relative improvement of 2.5%-100% (median 18.4%), the largest improvements occurring in the smaller datasets. In contrast to previous studies, our results generalise across external datasets, devices, and environments. Our open-source model will help researchers and developers to build customisable and generalisable activity classifiers with high performance.
- [1349] arXiv:2208.11698 (replaced) [pdf, html, other]
-
Title: Rate-Distortion Theory for Mixed StatesComments: 25 pages, 2 figures. v3: updated with additional observations and notesSubjects: Quantum Physics (quant-ph); Information Theory (cs.IT)
This paper is concerned with quantum data compression of asymptotically many independent and identically distributed copies of ensembles of mixed quantum states. The encoder has access to a side information system. The figure of merit is per-copy or local error criterion. Rate-distortion theory studies the trade-off between the compression rate and the per-copy error. The optimal trade-off can be characterized by the rate-distortion function, which is the best rate given a certain distortion. In this paper, we derive the rate-distortion function of mixed-state compression. The rate-distortion functions in the entanglement-assisted and unassisted scenarios are in terms of a single-letter mutual information quantity and the regularized entanglement of purification, respectively. For the general setting where the consumption of both communication and entanglement are considered, we present the full qubit-entanglement rate region. Our compression scheme covers both blind and visible compression models (and other models in between) depending on the structure of the side information system.
- [1350] arXiv:2208.13296 (replaced) [pdf, html, other]
-
Title: Polynomial time guarantees for sampling based posterior inference in high-dimensional generalised linear modelsComments: Revised and updated versionSubjects: Statistics Theory (math.ST); Analysis of PDEs (math.AP); Numerical Analysis (math.NA); Probability (math.PR); Computation (stat.CO)
The problem of computing posterior functionals in general high-dimensional statistical models with possibly non-log-concave likelihood functions is considered. Based on the proof strategy of [49], but using only local likelihood conditions and without relying on M-estimation theory, nonasymptotic statistical and computational guarantees are provided for a gradient based MCMC algorithm. Given a suitable initialiser, these guarantees scale polynomially in key algorithmic quantities. The abstract results are applied to several concrete statistical models, including density estimation, nonparametric regression with generalised linear models and a canonical statistical non-linear inverse problem from PDEs.
- [1351] arXiv:2209.11920 (replaced) [pdf, html, other]
-
Title: Tradeoffs between convergence rate and noise amplification for momentum-based accelerated optimization algorithmsComments: 23 pages; 7 figuresSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS)
We study momentum-based first-order optimization algorithms in which the iterations utilize information from the two previous steps and are subject to an additive white noise. This setup uses noise to account for uncertainty in either gradient evaluation or iteration updates, and it includes Polyak's heavy-ball and Nesterov's accelerated methods as special cases. For strongly convex quadratic problems, we use the steady-state variance of the error in the optimization variable to quantify noise amplification and identify fundamental stochastic performance tradeoffs. Our approach utilizes the Jury stability criterion to provide a novel geometric characterization of conditions for linear convergence, and it reveals the relation between the noise amplification and convergence rate as well as their dependence on the condition number and the constant algorithmic parameters. This geometric insight leads to simple alternative proofs of standard convergence results and allows us to establish ``uncertainty principle'' of strongly convex optimization: for the two-step momentum method with linear convergence rate, the lower bound on the product between the settling time and noise amplification scales quadratically with the condition number. Our analysis also identifies a key difference between the gradient and iterate noise models: while the amplification of gradient noise can be made arbitrarily small by sufficiently decelerating the algorithm, the best achievable variance for the iterate noise model increases linearly with the settling time in the decelerating regime. Finally, we introduce two parameterized families of algorithms that strike a balance between noise amplification and settling time while preserving order-wise Pareto optimality for both noise models.
- [1352] arXiv:2210.14484 (replaced) [pdf, html, other]
-
Title: Imputation of missing values in multi-view dataComments: 49 pages, 15 figures. Accepted manuscriptJournal-ref: Information Fusion 111 (2024) 102524Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
Data for which a set of objects is described by multiple distinct feature sets (called views) is known as multi-view data. When missing values occur in multi-view data, all features in a view are likely to be missing simultaneously. This may lead to very large quantities of missing data which, especially when combined with high-dimensionality, can make the application of conditional imputation methods computationally infeasible. However, the multi-view structure could be leveraged to reduce the complexity and computational load of imputation. We introduce a new imputation method based on the existing stacked penalized logistic regression (StaPLR) algorithm for multi-view learning. It performs imputation in a dimension-reduced space to address computational challenges inherent to the multi-view context. We compare the performance of the new imputation method with several existing imputation algorithms in simulated data sets and a real data application. The results show that the new imputation method leads to competitive results at a much lower computational cost, and makes the use of advanced imputation algorithms such as missForest and predictive mean matching possible in settings where they would otherwise be computationally infeasible.
- [1353] arXiv:2211.15498 (replaced) [pdf, html, other]
-
Title: Physics-informed Neural Networks with Unknown Measurement NoiseSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Physics-informed neural networks (PINNs) constitute a flexible approach to both finding solutions and identifying parameters of partial differential equations. Most works on the topic assume noiseless data, or data contaminated with weak Gaussian noise. We show that the standard PINN framework breaks down in case of non-Gaussian noise. We give a way of resolving this fundamental issue and we propose to jointly train an energy-based model (EBM) to learn the correct noise distribution. We illustrate the improved performance of our approach using multiple examples.
- [1354] arXiv:2212.14041 (replaced) [pdf, html, other]
-
Title: Deciphering RNA Secondary Structure Prediction: A Probabilistic K-Rook Matching PerspectiveCheng Tan, Zhangyang Gao, Hanqun Cao, Xingran Chen, Ge Wang, Lirong Wu, Jun Xia, Jiangbin Zheng, Stan Z. LiComments: Accepted by ICML 2024Subjects: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The secondary structure of ribonucleic acid (RNA) is more stable and accessible in the cell than its tertiary structure, making it essential for functional prediction. Although deep learning has shown promising results in this field, current methods suffer from poor generalization and high complexity. In this work, we reformulate the RNA secondary structure prediction as a K-Rook problem, thereby simplifying the prediction process into probabilistic matching within a finite solution space. Building on this innovative perspective, we introduce RFold, a simple yet effective method that learns to predict the most matching K-Rook solution from the given sequence. RFold employs a bi-dimensional optimization strategy that decomposes the probabilistic matching problem into row-wise and column-wise components to reduce the matching complexity, simplifying the solving process while guaranteeing the validity of the output. Extensive experiments demonstrate that RFold achieves competitive performance and about eight times faster inference efficiency than the state-of-the-art approaches. The code and Colab demo are available in (this http URL).
- [1355] arXiv:2301.05551 (replaced) [pdf, html, other]
-
Title: Dynamic Basis Function Interpolation for Adaptive In Situ Data Integration in Ocean ModelingSubjects: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Dynamical Systems (math.DS)
We propose a new method for combining in situ buoy measurements with Earth system models (ESMs) to improve the accuracy of temperature predictions in the ocean. The technique utilizes the dynamics \textit{and} modes identified in ESMs alongside buoy measurements to improve accuracy while preserving features such as seasonality. We use this technique, which we call Dynamic Basis Function Interpolation, to correct errors in localized temperature predictions made by the Model for Prediction Across Scales Ocean component (MPAS-O) with the Global Drifter Program's in situ ocean buoy dataset.
- [1356] arXiv:2302.08176 (replaced) [pdf, html, other]
-
Title: The logic behind desirable sets of things, and its filter representationComments: 38 pages, 1 figure, 2 tablesSubjects: Logic (math.LO); Artificial Intelligence (cs.AI)
We identify the (filter representation of the) logic behind the recent theory of coherent sets of desirable (sets of) things, which generalise coherent sets of desirable (sets of) gambles as well as coherent choice functions, and show that this identification allows us to establish various representation results for such coherent models in terms of simpler ones.
- [1357] arXiv:2302.13186 (replaced) [pdf, html, other]
-
Title: Construction numbers: How to build a graph?Comments: 19 pages; Problem 12401 - 06 in the June 2023 issue of the American Math Monthly was to find the construction number for $K_n$. Solution is given hereSubjects: Combinatorics (math.CO); Artificial Intelligence (cs.AI)
Counting the number of linear extensions of a partial order was considered by Stanley about 50 years ago. For the partial order on the vertices and edges of a graph given by inclusion, we call a linear extension a {\it construction sequence} for the graph as each edge follows the vertices where it is attached. The number of these c-sequences is counted for various graph families. We also consider the set of all length-$n$ c-sequences produced by the graphs with $n$ elements, simplified to their structural skeleton: vertex or edge, and further allow the generating graph to be structurally constrained. Efficiency is analyzed.
- [1358] arXiv:2306.15728 (replaced) [pdf, html, other]
-
Title: Physics-inspired spatiotemporal-graph AI ensemble for the detection of higher order wave mode signals of spinning binary black hole mergersComments: 14 pages, 6 figures, and 3 tablesJournal-ref: Mach. Learn.: Sci. Technol. 5 025056 (2024) Mach. Learn.: Sci. Technol. 5 025056 Mach. Learn.: Sci. Technol. 5 025056Subjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); General Relativity and Quantum Cosmology (gr-qc)
We present a new class of AI models for the detection of quasi-circular, spinning, non-precessing binary black hole mergers whose waveforms include the higher order gravitational wave modes $(l, |m|)=\{(2, 2), (2, 1), (3, 3), (3, 2), (4, 4)\}$, and mode mixing effects in the $l = 3, |m| = 2$ harmonics. These AI models combine hybrid dilated convolution neural networks to accurately model both short- and long-range temporal sequential information of gravitational waves; and graph neural networks to capture spatial correlations among gravitational wave observatories to consistently describe and identify the presence of a signal in a three detector network encompassing the Advanced LIGO and Virgo detectors. We first trained these spatiotemporal-graph AI models using synthetic noise, using 1.2 million modeled waveforms to densely sample this signal manifold, within 1.7 hours using 256 A100 GPUs in the Polaris supercomputer at the ALCF. Our distributed training approach had optimal performance, and strong scaling up to 512 A100 GPUs. With these AI ensembles we processed data from a three detector network, and found that an ensemble of 4 AI models achieves state-of-the-art performance for signal detection, and reports two misclassifications for every decade of searched data. We distributed AI inference over 128 GPUs in the Polaris supercomputer and 128 nodes in the Theta supercomputer, and completed the processing of a decade of gravitational wave data from a three detector network within 3.5 hours. Finally, we fine-tuned these AI ensembles to process the entire month of February 2020, which is part of the O3b LIGO/Virgo observation run, and found 6 gravitational waves, concurrently identified in Advanced LIGO and Advanced Virgo data, and zero false positives. This analysis was completed in one hour using one A100 GPU.
- [1359] arXiv:2307.05125 (replaced) [pdf, html, other]
-
Title: Linearizing Binary Optimization Problems Using Variable Posets for Ising MachinesComments: 22 pages. This article has been accepted for publication in IEEE Transactions on Emerging Topics in ComputingSubjects: Optimization and Control (math.OC); Statistical Mechanics (cond-mat.stat-mech); Emerging Technologies (cs.ET); Applied Physics (physics.app-ph)
Ising machines are next-generation computers expected to efficiently sample near-optimal solutions of combinatorial optimization problems. Combinatorial optimization problems are modeled as quadratic unconstrained binary optimization (QUBO) problems to apply an Ising machine. However, current state-of-the-art Ising machines still often fail to output near-optimal solutions due to the complicated energy landscape of QUBO problems. Furthermore, the physical implementation of Ising machines severely restricts the size of QUBO problems to be input as a result of limited hardware graph structures. In this study, we take a new approach to these challenges by injecting auxiliary penalties preserving the optimum, which reduces quadratic terms in QUBO objective functions. The process simultaneously simplifies the energy landscape of QUBO problems, allowing the search for near-optimal solutions, and makes QUBO problems sparser, facilitating encoding into Ising machines with restriction on the hardware graph structure. We propose linearization of QUBO problems using variable posets as an outcome of the approach. By applying the proposed method to synthetic QUBO instances and to multi-dimensional knapsack problems, we empirically validate the effects on enhancing minor-embedding of QUBO problems and the performance of Ising machines.
- [1360] arXiv:2307.07357 (replaced) [pdf, html, other]
-
Title: Inverse Optimization for Routing ProblemsSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG)
We propose a method for learning decision-makers' behavior in routing problems using Inverse Optimization (IO). The IO framework falls into the supervised learning category and builds on the premise that the target behavior is an optimizer of an unknown cost function. This cost function is to be learned through historical data, and in the context of routing problems, can be interpreted as the routing preferences of the decision-makers. In this view, the main contributions of this study are to propose an IO methodology with a hypothesis function, loss function, and stochastic first-order algorithm tailored to routing problems. We further test our IO approach in the Amazon Last Mile Routing Research Challenge, where the goal is to learn models that replicate the routing preferences of human drivers, using thousands of real-world routing examples. Our final IO-learned routing model achieves a score that ranks 2nd compared with the 48 models that qualified for the final round of the challenge. Our examples and results showcase the flexibility and real-world potential of the proposed IO methodology to learn from decision-makers' decisions in routing problems.
- [1361] arXiv:2307.12360 (replaced) [pdf, html, other]
-
Title: Unravelling the Mechanics of Knitted Fabrics Through Hierarchical Geometric RepresentationSubjects: Soft Condensed Matter (cond-mat.soft); Numerical Analysis (math.NA)
Knitting interloops one-dimensional yarns into three-dimensional fabrics that exhibit behaviours beyond their constitutive materials. How extensibility and anisotropy emerge from the hierarchical organisation of yarns into knitted fabrics has long been unresolved. We sought to unravel the mechanical roles of tensile mechanics, assembly and dynamics arising from the yarn level on fabric nonlinearity by developing a yarn-based dynamical model. This physically validated model captures the fundamental mechanical response of knitted fabrics, analogous to flexible metamaterials and biological fiber networks due to geometric nonlinearity within such hierarchical systems. Fabric anisotropy originates from observed yarn-yarn rearrangements during alignment dynamics and is topology-dependent. This yarn-based model also provides a design space of knitted fabrics to embed functionalities by varying geometric configuration and material property in instructed procedures compatible to machine manufacturing. Our hierarchical approach to build up a knitted fabrics computationally modernizes an ancient craft and represents a first step towards mechanical programmability of knitted fabrics in wide engineering applications.
- [1362] arXiv:2307.13520 (replaced) [pdf, html, other]
-
Title: Using power system modelling outputs to identify weather-induced extreme events in highly renewable systemsJournal-ref: Environmental Research Letters, Volume 19, Number 5 (2024)Subjects: Physics and Society (physics.soc-ph); Systems and Control (eess.SY)
In highly renewable power systems the increased weather dependence can result in new resilience challenges, such as renewable energy droughts, or a lack of sufficient renewable generation at times of high demand. The weather conditions responsible for these challenges have been well-studied in the literature. However, in reality multi-day resilience challenges are triggered by complex interactions between high demand, low renewable availability, electricity transmission constraints and storage dynamics. We show these challenges cannot be rigorously understood from an exclusively power systems, or meteorological, perspective. We propose a new method that uses electricity shadow prices - obtained by a European power system model based on 40 years of reanalysis data - to identify the most difficult periods driving system investments. Such difficult periods are driven by large-scale weather conditions such as low wind and cold temperature periods of various lengths associated with stationary high pressure over Europe. However, purely meteorological approaches fail to identify which events lead to the largest system stress over the multi-decadal study period due to the influence of subtle transmission bottlenecks and storage issues across multiple regions. These extreme events also do not relate strongly to traditional weather patterns (such as Euro-Atlantic weather regimes or the North Atlantic Oscillation index). We therefore compile a new set of weather patterns to define energy system stress events which include the impacts of electricity storage and large-scale interconnection. Without interdisciplinary studies combining state-of-the-art energy meteorology and modelling, further strive for adequate renewable power systems will be hampered.
- [1363] arXiv:2307.14441 (replaced) [pdf, html, other]
-
Title: Dense outputs from quantum simulationsComments: 30 pages. Accepted by Journal of Computational PhysicsSubjects: Quantum Physics (quant-ph); Numerical Analysis (math.NA)
The quantum dense output problem is the process of evaluating time-accumulated observables from time-dependent quantum dynamics using quantum computers. This problem arises frequently in applications such as quantum control and spectroscopic computation. We present a range of algorithms designed to operate on both early and fully fault-tolerant quantum platforms. These methodologies draw upon techniques like amplitude estimation, Hamiltonian simulation, quantum linear Ordinary Differential Equation (ODE) solvers, and quantum Carleman linearization. We provide a comprehensive complexity analysis with respect to the evolution time $T$ and error tolerance $\epsilon$. Our results demonstrate that the linearization approach can nearly achieve optimal complexity $\mathcal{O}(T/\epsilon)$ for a certain type of low-rank dense outputs. Moreover, we provide a linearization of the dense output problem that yields an exact and finite-dimensional closure which encompasses the original states. This formulation is related to the Koopman Invariant Subspace theory and may be of independent interest in nonlinear control and scientific machine learning.
- [1364] arXiv:2308.16681 (replaced) [pdf, html, other]
-
Title: One Model Many Scores: Using Multiverse Analysis to Prevent Fairness Hacking and Evaluate the Influence of Model Design DecisionsJournal-ref: FAccT '24: Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (2024) 1305-1320Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
A vast number of systems across the world use algorithmic decision making (ADM) to (partially) automate decisions that have previously been made by humans. The downstream effects of ADM systems critically depend on the decisions made during a systems' design, implementation, and evaluation, as biases in data can be mitigated or reinforced along the modeling pipeline. Many of these decisions are made implicitly, without knowing exactly how they will influence the final system. To study this issue, we draw on insights from the field of psychology and introduce the method of multiverse analysis for algorithmic fairness. In our proposed method, we turn implicit decisions during design and evaluation into explicit ones and demonstrate their fairness implications. By combining decisions, we create a grid of all possible "universes" of decision combinations. For each of these universes, we compute metrics of fairness and performance. Using the resulting dataset, one can investigate the variability and robustness of fairness scores and see how and which decisions impact fairness. We demonstrate how multiverse analyses can be used to better understand fairness implications of design and evaluation decisions using an exemplary case study of predicting public health care coverage for vulnerable populations. Our results highlight how decisions regarding the evaluation of a system can lead to vastly different fairness metrics for the same model. This is problematic, as a nefarious actor could optimise or "hack" a fairness metric to portray a discriminating model as fair merely by changing how it is evaluated. We illustrate how a multiverse analysis can help to address this issue.
- [1365] arXiv:2309.14780 (replaced) [pdf, other]
-
Title: Transferring climate change knowledgeFrancesco Immorlano, Veronika Eyring, Thomas le Monnier de Gouville, Gabriele Accarino, Donatello Elia, Giovanni Aloisio, Pierre GentineSubjects: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Accurate and precise climate projections are required for climate adaptation and mitigation, but Earth system models still exhibit great uncertainties. Several approaches have been developed to reduce the spread of climate projections and feedbacks, yet those methods cannot capture the non-linear complexity inherent in the climate system. Using a Transfer Learning approach, we show that Machine Learning can be used to optimally leverage and merge the knowledge gained from Earth system models simulations and historical observations to more accurately project global surface air temperature fields in the 21st century. We reach an uncertainty reduction of more than 50% with respect to state-of-the-art approaches. We give evidence that our novel method provides narrower projection uncertainty together with more accurate mean climate projections, urgently required for climate adaptation.
- [1366] arXiv:2309.15001 (replaced) [pdf, html, other]
-
Title: Convergence guarantees for forward gradient descent in the linear regression modelComments: 17 pagesJournal-ref: Journal of Statistical Planning and Inference, Volume 233, 106174, 2024Subjects: Statistics Theory (math.ST); Neural and Evolutionary Computing (cs.NE)
Renewed interest in the relationship between artificial and biological neural networks motivates the study of gradient-free methods. Considering the linear regression model with random design, we theoretically analyze in this work the biologically motivated (weight-perturbed) forward gradient scheme that is based on random linear combination of the gradient. If d denotes the number of parameters and k the number of samples, we prove that the mean squared error of this method converges for $k\gtrsim d^2\log(d)$ with rate $d^2\log(d)/k.$ Compared to the dimension dependence d for stochastic gradient descent, an additional factor $d\log(d)$ occurs.
- [1367] arXiv:2310.03559 (replaced) [pdf, html, other]
-
Title: MedSyn: Text-guided Anatomy-aware Synthesis of High-Fidelity 3D CT ImagesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
This paper introduces an innovative methodology for producing high-quality 3D lung CT images guided by textual information. While diffusion-based generative models are increasingly used in medical imaging, current state-of-the-art approaches are limited to low-resolution outputs and underutilize radiology reports' abundant information. The radiology reports can enhance the generation process by providing additional guidance and offering fine-grained control over the synthesis of images. Nevertheless, expanding text-guided generation to high-resolution 3D images poses significant memory and anatomical detail-preserving challenges. Addressing the memory issue, we introduce a hierarchical scheme that uses a modified UNet architecture. We start by synthesizing low-resolution images conditioned on the text, serving as a foundation for subsequent generators for complete volumetric data. To ensure the anatomical plausibility of the generated samples, we provide further guidance by generating vascular, airway, and lobular segmentation masks in conjunction with the CT images. The model demonstrates the capability to use textual input and segmentation tasks to generate synthesized images. The results of comparative assessments indicate that our approach exhibits superior performance compared to the most advanced models based on GAN and diffusion techniques, especially in accurately retaining crucial anatomical features such as fissure lines, airways, and vascular structures. This innovation introduces novel possibilities. This study focuses on two main objectives: (1) the development of a method for creating images based on textual prompts and anatomical components, and (2) the capability to generate new images conditioning on anatomical elements. The advancements in image generation can be applied to enhance numerous downstream tasks.
- [1368] arXiv:2310.09766 (replaced) [pdf, html, other]
-
Title: Pseudo-Bayesian OptimizationSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Bayesian Optimization is a popular approach for optimizing expensive black-box functions. Its key idea is to use a surrogate model to approximate the objective and, importantly, quantify the associated uncertainty that allows a sequential search of query points that balance exploitation-exploration. Gaussian process (GP) has been a primary candidate for the surrogate model, thanks to its Bayesian-principled uncertainty quantification power and modeling flexibility. However, its challenges have also spurred an array of alternatives whose convergence properties could be more opaque. Motivated by these, we study in this paper an axiomatic framework that elicits the minimal requirements to guarantee black-box optimization convergence that could apply beyond GP-based methods. Moreover, we leverage the design freedom in our framework, which we call Pseudo-Bayesian Optimization, to construct empirically superior algorithms. In particular, we show how using simple local regression, and a suitable "randomized prior" construction to quantify uncertainty, not only guarantees convergence but also consistently outperforms state-of-the-art benchmarks in examples ranging from high-dimensional synthetic experiments to realistic hyperparameter tuning and robotic applications.
- [1369] arXiv:2310.17467 (replaced) [pdf, html, other]
-
Title: The statistical thermodynamics of generative diffusion models: Phase transitions, symmetry breaking and critical instabilitySubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Generative diffusion models have achieved spectacular performance in many areas of machine learning and generative modeling. While the fundamental ideas behind these models come from non-equilibrium physics, variational inference and stochastic calculus, in this paper we show that many aspects of these models can be understood using the tools of equilibrium statistical mechanics. Using this reformulation, we show that generative diffusion models undergo second-order phase transitions corresponding to symmetry breaking phenomena. We show that these phase-transitions are always in a mean-field universality class, as they are the result of a self-consistency condition in the generative dynamics. We argue that the critical instability that arises from the phase transitions lies at the heart of their generative capabilities, which are characterized by a set of mean-field critical exponents. Finally, we show that the dynamic equation of the generative process can be interpreted as a stochastic adiabatic transformation that minimizes the free energy while keeping the system in thermal equilibrium.
- [1370] arXiv:2311.00483 (replaced) [pdf, html, other]
-
Title: DEFN: Dual-Encoder Fourier Group Harmonics Network for Three-Dimensional Indistinct-Boundary Object SegmentationXiaohua Jiang, Yihao Guo, Jian Huang, Yuting Wu, Meiyi Luo, Zhaoyang Xu, Qianni Zhang, Xingru Huang, Hong He, Shaowei Jiang, Jing Ye, Mang XiaoComments: 36pages,16figures,7tablesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
The precise spatial and quantitative delineation of indistinct-boundary medical objects is paramount for the accuracy of diagnostic protocols, efficacy of surgical interventions, and reliability of postoperative assessments. Despite their significance, the effective segmentation and instantaneous three-dimensional reconstruction are significantly impeded by the paucity of representative samples in available datasets and noise artifacts. To surmount these challenges, we introduced Stochastic Defect Injection (SDi) to augment the representational diversity of challenging indistinct-boundary objects within training corpora. Consequently, we propose the Dual-Encoder Fourier Group Harmonics Network (DEFN) to tailor noise filtration, amplify detailed feature recognition, and bolster representation across diverse medical imaging scenarios. By incorporating Dynamic Weight Composing (DWC) loss dynamically adjusts model's focus based on training progression, DEFN achieves SOTA performance on the OIMHS public dataset, showcasing effectiveness in indistinct boundary contexts. Source code for DEFN is available at: this https URL.
- [1371] arXiv:2311.05529 (replaced) [pdf, html, other]
-
Title: Information-theoretic generalization bounds for learning from quantum dataComments: 48+14 pages, 4 figuresSubjects: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Information Theory (cs.IT); Machine Learning (cs.LG)
Learning tasks play an increasingly prominent role in quantum information and computation. They range from fundamental problems such as state discrimination and metrology over the framework of quantum probably approximately correct (PAC) learning, to the recently proposed shadow variants of state tomography. However, the many directions of quantum learning theory have so far evolved separately. We propose a general mathematical formalism for describing quantum learning by training on classical-quantum data and then testing how well the learned hypothesis generalizes to new data. In this framework, we prove bounds on the expected generalization error of a quantum learner in terms of classical and quantum information-theoretic quantities measuring how strongly the learner's hypothesis depends on the specific data seen during training. To achieve this, we use tools from quantum optimal transport and quantum concentration inequalities to establish non-commutative versions of decoupling lemmas that underlie recent information-theoretic generalization bounds for classical machine learning. Our framework encompasses and gives intuitively accessible generalization bounds for a variety of quantum learning scenarios such as quantum state discrimination, PAC learning quantum states, quantum parameter estimation, and quantumly PAC learning classical functions. Thereby, our work lays a foundation for a unifying quantum information-theoretic perspective on quantum learning.
- [1372] arXiv:2311.06253 (replaced) [pdf, html, other]
-
Title: Advancing Parsimonious Deep Learning Weather Prediction using the HEALPix MeshMatthias Karlbauer, Nathaniel Cresswell-Clay, Dale R. Durran, Raul A. Moreno, Thorsten Kurth, Boris Bonev, Noah Brenowitz, Martin V. ButzComments: Submitted to Journal of Advances in Modeling Earth Systems (JAMES)Subjects: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
We present a parsimonious deep learning weather prediction model to forecast seven atmospheric variables with 3-h time resolution for up to one-year lead times on a 110-km global mesh using the Hierarchical Equal Area isoLatitude Pixelization (HEALPix). In comparison to state-of-the-art (SOTA) machine learning (ML) weather forecast models, such as Pangu-Weather and GraphCast, our DLWP-HPX model uses coarser resolution and far fewer prognostic variables. Yet, at one-week lead times, its skill is only about one day behind both SOTA ML forecast models and the SOTA numerical weather prediction model from the European Centre for Medium-Range Weather Forecasts. We report several improvements in model design, including switching from the cubed sphere to the HEALPix mesh, inverting the channel depth of the U-Net, and introducing gated recurrent units (GRU) on each level of the U-Net hierarchy. The consistent east-west orientation of all cells on the HEALPix mesh facilitates the development of location-invariant convolution kernels that successfully propagate weather patterns across the globe without requiring separate kernels for the polar and equatorial faces of the cube sphere. Without any loss of spectral power after the first two days, the model can be unrolled autoregressively for hundreds of steps into the future to generate realistic states of the atmosphere that respect seasonal trends, as showcased in one-year simulations.
- [1373] arXiv:2311.08636 (replaced) [pdf, html, other]
-
Title: Supervised low-rank semi-nonnegative matrix factorization with frequency regularization for forecasting spatio-temporal dataComments: 35 pages, Final versionJournal-ref: Journal of Scientific Computing (2024)Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
We propose a novel methodology for forecasting spatio-temporal data using supervised semi-nonnegative matrix factorization (SSNMF) with frequency regularization. Matrix factorization is employed to decompose spatio-temporal data into spatial and temporal components. To improve clarity in the temporal patterns, we introduce a nonnegativity constraint on the time domain along with regularization in the frequency domain. Specifically, regularization in the frequency domain involves selecting features in the frequency space, making an interpretation in the frequency domain more convenient. We propose two methods in the frequency domain: soft and hard regularizations, and provide convergence guarantees to first-order stationary points of the corresponding constrained optimization problem. While our primary motivation stems from geophysical data analysis based on GRACE (Gravity Recovery and Climate Experiment) data, our methodology has the potential for wider application. Consequently, when applying our methodology to GRACE data, we find that the results with the proposed methodology are comparable to previous research in the field of geophysical sciences but offer clearer interpretability.
- [1374] arXiv:2311.10224 (replaced) [pdf, html, other]
-
Title: CV-Attention UNet: Attention-based UNet for 3D Cerebrovascular Segmentation of Enhanced TOF-MRA ImagesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Due to the lack of automated methods, to diagnose cerebrovascular disease, time-of-flight magnetic resonance angiography (TOF-MRA) is assessed visually, making it time-consuming. The commonly used encoder-decoder architectures for cerebrovascular segmentation utilize redundant features, eventually leading to the extraction of low-level features multiple times. Additionally, convolutional neural networks (CNNs) suffer from performance degradation when the batch size is small, and deeper networks experience the vanishing gradient problem. Methods: In this paper, we attempt to solve these limitations and propose the 3D cerebrovascular attention UNet method, named CV-AttentionUNet, for precise extraction of brain vessel images. We proposed a sequence of preprocessing techniques followed by deeply supervised UNet to improve the accuracy of segmentation of the brain vessels leading to a stroke. To combine the low and high semantics, we applied the attention mechanism. This mechanism focuses on relevant associations and neglects irrelevant anatomical information. Furthermore, the inclusion of deep supervision incorporates different levels of features that prove to be beneficial for network convergence. Results: We demonstrate the efficiency of the proposed method by cross-validating with an unlabeled dataset, which was further labeled by us. We believe that the novelty of this algorithm lies in its ability to perform well on both labeled and unlabeled data with image processing-based enhancement. The results indicate that our method performed better than the existing state-of-the-art methods on the TubeTK dataset. Conclusion: The proposed method will help in accurate segmentation of cerebrovascular structure leading to stroke
- [1375] arXiv:2311.10433 (replaced) [pdf, html, other]
-
Title: Task Scheduling Optimization from a Tensor Network PerspectiveComments: 8 pages, 4 figuresSubjects: Quantum Physics (quant-ph); Emerging Technologies (cs.ET)
We present a novel method for task optimization in industrial plants using quantum-inspired tensor network technology. This method allows us to obtain the best possible combination of tasks on a set of machines with a set of constraints without having to evaluate all possible combinations. We simulate a quantum system with all possible combinations, perform an imaginary time evolution and a series of projections to satisfy the constraints. We improve its scalability by means of a compression method, an iterative algorithm, and a genetic algorithm, and show the results obtained on simulated cases.
- [1376] arXiv:2311.11900 (replaced) [pdf, html, other]
-
Title: Measuring and Mitigating Biases in Motor Insurance PricingSubjects: Machine Learning (stat.ML); Computers and Society (cs.CY); Machine Learning (cs.LG)
The non-life insurance sector operates within a highly competitive and tightly regulated framework, confronting a pivotal juncture in the formulation of pricing strategies. Insurers are compelled to harness a range of statistical methodologies and available data to construct optimal pricing structures that align with the overarching corporate strategy while accommodating the dynamics of market competition. Given the fundamental societal role played by insurance, premium rates are subject to rigorous scrutiny by regulatory authorities. These rates must conform to principles of transparency, explainability, and ethical considerations. Consequently, the act of pricing transcends mere statistical calculations and carries the weight of strategic and societal factors. These multifaceted concerns may drive insurers to establish equitable premiums, taking into account various variables. For instance, regulations mandate the provision of equitable premiums, considering factors such as policyholder gender or mutualist group dynamics in accordance with respective corporate strategies. Age-based premium fairness is also mandated. In certain insurance domains, variables such as the presence of serious illnesses or disabilities are emerging as new dimensions for evaluating fairness. Regardless of the motivating factor prompting an insurer to adopt fairer pricing strategies for a specific variable, the insurer must possess the capability to define, measure, and ultimately mitigate any ethical biases inherent in its pricing practices while upholding standards of consistency and performance. This study seeks to provide a comprehensive set of tools for these endeavors and assess their effectiveness through practical application in the context of automobile insurance.
- [1377] arXiv:2311.13564 (replaced) [pdf, html, other]
-
Title: High order universal portfoliosSubjects: Portfolio Management (q-fin.PM); Numerical Analysis (math.NA)
The Cover universal portfolio (UP from now on) has many interesting theoretical and numerical properties and was investigated for a long time. Building on it, we explore what happens when we add this UP to the market as a new synthetic asset and construct by recurrence higher order UPs. We investigate some important theoretical properties of the high order UPs and show in particular that they are indeed different from the Cover UP and are capable to break the time permutation invariance. We show that under some perturbation regime the second high order UP has better Sharp ratio than the standard UP and briefly investigate arbitrage opportunities thus created. Numerical experiences on a benchmark from the literature confirm that high order UPs improve Cover's UP performances.
- [1378] arXiv:2312.00128 (replaced) [pdf, html, other]
-
Title: Low latency optical-based mode tracking with machine learning deployed on FPGAs on a tokamakYumou Wei, Ryan F. Forelli, Chris Hansen, Jeffrey P. Levesque, Nhan Tran, Joshua C. Agar, Giuseppe Di Guglielmo, Michael E. Mauel, Gerald A. NavratilComments: The following article has been submitted to/accepted by Review of Scientific Instruments. After it is published, it will be found at this https URLSubjects: Plasma Physics (physics.plasm-ph); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Instrumentation and Detectors (physics.ins-det)
Active feedback control in magnetic confinement fusion devices is desirable to mitigate plasma instabilities and enable robust operation. Optical high-speed cameras provide a powerful, non-invasive diagnostic and can be suitable for these applications. In this study, we process fast camera data, at rates exceeding 100kfps, on $\textit{in situ}$ Field Programmable Gate Array (FPGA) hardware to track magnetohydrodynamic (MHD) mode evolution and generate control signals in real-time. Our system utilizes a convolutional neural network (CNN) model which predicts the $n$=1 MHD mode amplitude and phase using camera images with better accuracy than other tested non-deep-learning-based methods. By implementing this model directly within the standard FPGA readout hardware of the high-speed camera diagnostic, our mode tracking system achieves a total trigger-to-output latency of 17.6$\mu$s and a throughput of up to 120kfps. This study at the High Beta Tokamak-Extended Pulse (HBT-EP) experiment demonstrates an FPGA-based high-speed camera data acquisition and processing system, enabling application in real-time machine-learning-based tokamak diagnostic and control as well as potential applications in other scientific domains.
- [1379] arXiv:2312.02277 (replaced) [pdf, other]
-
Title: ALEXR: An Optimal Single-Loop Algorithm for Convex Finite-Sum Coupled Compositional Stochastic OptimizationComments: Fixed several typosSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG)
This paper revisits a class of convex Finite-Sum Coupled Compositional Stochastic Optimization (cFCCO) problems with many applications, including group distributionally robust optimization (GDRO), learning with imbalanced data, reinforcement learning, and learning to rank. To better solve these problems, we introduce an efficient single-loop primal-dual block-coordinate proximal algorithm, dubbed ALEXR. This algorithm leverages block-coordinate stochastic mirror ascent updates for the dual variable and stochastic proximal gradient descent updates for the primal variable. We establish the convergence rates of ALEXR in both convex and strongly convex cases under smoothness and non-smoothness conditions of involved functions, which not only improve the best rates in previous works on smooth cFCCO problems but also expand the realm of cFCCO for solving more challenging non-smooth problems such as the dual form of GDRO. Finally, we present lower complexity bounds to demonstrate that the convergence rates of ALEXR are optimal among first-order block-coordinate stochastic algorithms for the considered class of cFCCO problems.
- [1380] arXiv:2312.04022 (replaced) [pdf, html, other]
-
Title: Analysis of Coding Gain Due to In-Loop ReshapingComments: Published in IEEE Transactions on Image ProcessingSubjects: Image and Video Processing (eess.IV); Information Theory (cs.IT)
Reshaping, a point operation that alters the characteristics of signals, has been shown capable of improving the compression ratio in video coding practices. Out-of-loop reshaping that directly modifies the input video signal was first adopted as the supplemental enhancement information (SEI) for the HEVC/H.265 without the need to alter the core design of the video codec. VVC/H.266 further improves the coding efficiency by adopting in-loop reshaping that modifies the residual signal being processed in the hybrid coding loop. In this paper, we theoretically analyze the rate-distortion performance of the in-loop reshaping and use experiments to verify the theoretical result. We prove that the in-loop reshaping can improve coding efficiency when the entropy coder adopted in the coding pipeline is suboptimal, which is in line with the practical scenarios that video codecs operate in. We derive the PSNR gain in a closed form and show that the theoretically predicted gain is consistent with that measured from experiments using standard testing video sequences.
- [1381] arXiv:2401.06422 (replaced) [pdf, html, other]
-
Title: Joint Mechanical and Electrical Adjustment of IRS-aided LEO Satellite MIMO CommunicationsComments: 5 pages, 6 figuresSubjects: Signal Processing (eess.SP); Systems and Control (eess.SY)
In this letter, we propose a joint mechanical and electrical adjustment of intelligent reflecting surface (IRS) for the performance improvements of low-earth orbit (LEO) satellite multiple-input multiple-output (MIMO) communications. In particular, we construct a three-dimensional (3D) MIMO channel model for the mechanically-tilted IRS in general deployment, and consider two types of scenarios with and without the direct path of LEO-ground user link due to the orbital flight. With the aim of maximizing the end-to-end performance, we jointly optimize tilting angle and phase shift of IRS along with the transceiver beamforming, whose performance superiority is verified via simulations with the Orbcomm LEO satellite using a real orbit data.
- [1382] arXiv:2401.08627 (replaced) [pdf, html, other]
-
Title: Predicting and Interpreting Energy Barriers of Metallic Glasses with Graph Neural NetworksComments: Accepted at ICML 2024Subjects: Disordered Systems and Neural Networks (cond-mat.dis-nn); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
Metallic Glasses (MGs) are widely used materials that are stronger than steel while being shapeable as plastic. While understanding the structure-property relationship of MGs remains a challenge in materials science, studying their energy barriers (EBs) as an intermediary step shows promise. In this work, we utilize Graph Neural Networks (GNNs) to model MGs and study EBs. We contribute a new dataset for EB prediction and a novel Symmetrized GNN (SymGNN) model that is E(3)-invariant in expectation. SymGNN handles invariance by aggregating over orthogonal transformations of the graph structure. When applied to EB prediction, SymGNN are more accurate than molecular dynamics (MD) local-sampling methods and other machine-learning models. Compared to precise MD simulations, SymGNN reduces the inference time on new MGs from roughly 41 days to less than one second. We apply explanation algorithms to reveal the relationship between structures and EBs. The structures that we identify through explanations match the medium-range order (MRO) hypothesis and possess unique topological properties. Our work enables effective prediction and interpretation of MG EBs, bolstering material science research.
- [1383] arXiv:2401.10347 (replaced) [pdf, html, other]
-
Title: On a Rice theorem for dynamical properties of SFTs on groupsComments: Only changes in exposition. Comments welcome!Subjects: Dynamical Systems (math.DS); Information Theory (cs.IT); Combinatorics (math.CO); Group Theory (math.GR); Logic (math.LO)
Let $G$ be a group with undecidable domino problem (such as $\mathbb{Z}^2$). We prove the undecidability of all nontrivial dynamical properties for sofic $G$-subshifts, that such a result fails for SFTs, and an undecidability result for dynamical properties of $G$-SFTs similar to the Adian-Rabin theorem. For $G$ amenable we prove that topological entropy is not computable from presentations of SFTs, and a more general result for dynamical invariants taking values in partially ordered sets.
- [1384] arXiv:2402.01542 (replaced) [pdf, html, other]
-
Title: Learning Collective Variables with Synthetic Data Augmentation through Physics-inspired Geodesic InterpolationSubjects: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
In molecular dynamics simulations, rare events, such as protein folding, are typically studied using enhanced sampling techniques, most of which are based on the definition of a collective variable (CV) along which acceleration occurs. Obtaining an expressive CV is crucial, but often hindered by the lack of information about the particular event, e.g., the transition from unfolded to folded conformation. We propose a simulation-free data augmentation strategy using physics-inspired metrics to generate geodesic interpolations resembling protein folding transitions, thereby improving sampling efficiency without true transition state samples. This new data can be used to improve the accuracy of classifier-based methods. Alternatively, a regression-based learning scheme for CV models can be adopted by leveraging the interpolation progress parameter.
- [1385] arXiv:2402.05210 (replaced) [pdf, html, other]
-
Title: Anatomically-Controllable Medical Image Generation with Segmentation-Guided Diffusion ModelsComments: Accepted at MICCAI 2024. Code and synthetic dataset: this https URLSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
Diffusion models have enabled remarkably high-quality medical image generation, yet it is challenging to enforce anatomical constraints in generated images. To this end, we propose a diffusion model-based method that supports anatomically-controllable medical image generation, by following a multi-class anatomical segmentation mask at each sampling step. We additionally introduce a random mask ablation training algorithm to enable conditioning on a selected combination of anatomical constraints while allowing flexibility in other anatomical areas. We compare our method ("SegGuidedDiff") to existing methods on breast MRI and abdominal/neck-to-pelvis CT datasets with a wide range of anatomical objects. Results show that our method reaches a new state-of-the-art in the faithfulness of generated images to input anatomical masks on both datasets, and is on par for general anatomical realism. Finally, our model also enjoys the extra benefit of being able to adjust the anatomical similarity of generated images to real images of choice through interpolation in its latent space. SegGuidedDiff has many applications, including cross-modality translation, and the generation of paired or counterfactual data. Our code is available at this https URL.
- [1386] arXiv:2402.05271 (replaced) [pdf, html, other]
-
Title: Feature learning as alignment: a structural property of gradient descent in non-linear neural networksSubjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Understanding the mechanisms through which neural networks extract statistics from input-label pairs through feature learning is one of the most important unsolved problems in supervised learning. Prior works demonstrated that the gram matrices of the weights (the neural feature matrices, NFM) and the average gradient outer products (AGOP) become correlated during training, in a statement known as the neural feature ansatz (NFA). Through the NFA, the authors introduce mapping with the AGOP as a general mechanism for neural feature learning. However, these works do not provide a theoretical explanation for this correlation or its origins. In this work, we further clarify the nature of this correlation, and explain its emergence. We show that this correlation is equivalent to alignment between the left singular structure of the weight matrices and the newly defined pre-activation tangent features at each layer. We further establish that the alignment is driven by the interaction of weight changes induced by SGD with the pre-activation features, and analyze the resulting dynamics analytically at early times in terms of simple statistics of the inputs and labels. Finally, motivated by the observation that the NFA is driven by this centered correlation, we introduce a simple optimization rule that dramatically increases the NFA correlations at any given layer and improves the quality of features learned.
- [1387] arXiv:2402.07445 (replaced) [pdf, html, other]
-
Title: Top-$K$ ranking with a monotone adversaryComments: Accepted to Conference of Learning Theory, 2024Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
In this paper, we address the top-$K$ ranking problem with a monotone adversary. We consider the scenario where a comparison graph is randomly generated and the adversary is allowed to add arbitrary edges. The statistician's goal is then to accurately identify the top-$K$ preferred items based on pairwise comparisons derived from this semi-random comparison graph. The main contribution of this paper is to develop a weighted maximum likelihood estimator (MLE) that achieves near-optimal sample complexity, up to a $\log^2(n)$ factor, where $n$ denotes the number of items under comparison. This is made possible through a combination of analytical and algorithmic innovations. On the analytical front, we provide a refined~$\ell_\infty$ error analysis of the weighted MLE that is more explicit and tighter than existing analyses. It relates the~$\ell_\infty$ error with the spectral properties of the weighted comparison graph. Motivated by this, our algorithmic innovation involves the development of an SDP-based approach to reweight the semi-random graph and meet specified spectral properties. Additionally, we propose a first-order method based on the Matrix Multiplicative Weight Update (MMWU) framework. This method efficiently solves the resulting SDP in nearly-linear time relative to the size of the semi-random comparison graph.
- [1388] arXiv:2402.14742 (replaced) [pdf, html, other]
-
Title: New scattered linearized quadrinomialsSubjects: Combinatorics (math.CO); Information Theory (cs.IT)
Let $1<t<n$ be integers, where $t$ is a divisor of $n$. An R-$q^t$-partially scattered polynomial is a $\mathbb F_q$-linearized polynomial $f$ in $\mathbb F_{q^n}[X]$ that satisfies the condition that for all $x,y\in\mathbb F_{q^n}^*$ such that $x/y\in\mathbb F_{q^t}$, if $f(x)/x=f(y)/y$, then $x/y\in\mathbb F_q$; $f$ is called scattered if this implication holds for all $x,y\in\mathbb F_{q^n}^*$. Two polynomials in $\mathbb F_{q^n}[X]$ are said to be equivalent if their graphs are in the same orbit under the action of the group $\Gamma L(2,q^n)$. For $n>8$ only three families of scattered polynomials in $\mathbb F_{q^n}[X]$ are known: $(i)$~monomials of pseudoregulus type, $(ii)$~binomials of Lunardon-Polverino type, and $(iii)$~a family of quadrinomials defined in [1,10] and extended in [8,13]. In this paper we prove that the polynomial $\varphi_{m,q^J}=X^{q^{J(t-1)}}+X^{q^{J(2t-1)}}+m(X^{q^J}-X^{q^{J(t+1)}})\in\mathbb F_{q^{2t}}[X]$, $q$ odd, $t\ge3$ is R-$q^t$-partially scattered for every value of $m\in\mathbb F_{q^t}^*$ and $J$ coprime with $2t$. Moreover, for every $t>4$ and $q>5$ there exist values of $m$ for which $\varphi_{m,q}$ is scattered and new with respect to the polynomials mentioned in $(i)$, $(ii)$ and $(iii)$ above. The related linear sets are of $\Gamma L$-class at least two.
- [1389] arXiv:2403.00289 (replaced) [pdf, html, other]
-
Title: Optimization of array encoding for ultrasound imagingComments: 31 pages, 12 figuresJournal-ref: Phys. Med. Biol. 69 125024 (2024)Subjects: Medical Physics (physics.med-ph); Machine Learning (cs.LG); Optimization and Control (math.OC)
Objective: The transmit encoding model for synthetic aperture imaging is a robust and flexible framework for understanding the effects of acoustic transmission on ultrasound image reconstruction. Our objective is to use machine learning (ML) to construct scanning sequences, parameterized by time delays and apodization weights, that produce high-quality B-mode images. Approach: We use a custom ML model in PyTorch with simulated RF data from Field II to probe the space of possible encoding sequences for those that minimize a loss function that describes image quality. This approach is made computationally feasible by a novel formulation of the derivative for delay-and-sum beamforming. Main Results: When trained for a specified experimental setting (imaging domain, hardware restrictions, etc.), our ML model produces optimized encoding sequences that, when deployed in the REFoCUS imaging framework, improve a number of standard quality metrics over conventional sequences including resolution, field of view, and contrast. We demonstrate these results experimentally on both wire targets and a tissue-mimicking phantom. Significance: This work demonstrates that the set of commonly used encoding schemes represent only a narrow subset of those available. Additionally, it demonstrates the value for ML tasks in synthetic transmit aperture imaging to consider the beamformer within the model, instead of purely as a post-processing step.
- [1390] arXiv:2403.03589 (replaced) [pdf, html, other]
-
Title: Active Adaptive Experimental Design for Treatment Effect Estimation with Covariate ChoicesSubjects: Methodology (stat.ME); Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
This study designs an adaptive experiment for efficiently estimating average treatment effects (ATEs). In each round of our adaptive experiment, an experimenter sequentially samples an experimental unit, assigns a treatment, and observes the corresponding outcome immediately. At the end of the experiment, the experimenter estimates an ATE using the gathered samples. The objective is to estimate the ATE with a smaller asymptotic variance. Existing studies have designed experiments that adaptively optimize the propensity score (treatment-assignment probability). As a generalization of such an approach, we propose optimizing the covariate density as well as the propensity score. First, we derive the efficient covariate density and propensity score that minimize the semiparametric efficiency bound and find that optimizing both covariate density and propensity score minimizes the semiparametric efficiency bound more effectively than optimizing only the propensity score. Next, we design an adaptive experiment using the efficient covariate density and propensity score sequentially estimated during the experiment. Lastly, we propose an ATE estimator whose asymptotic variance aligns with the minimized semiparametric efficiency bound.
- [1391] arXiv:2404.01929 (replaced) [pdf, other]
-
Title: Towards Enhanced Analysis of Lung Cancer Lesions in EBUS-TBNA -- A Semi-Supervised Video Object Detection MethodSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
This study aims to establish a computer-aided diagnostic system for lung lesions using endobronchial ultrasound (EBUS) to assist physicians in identifying lesion areas. During EBUS-transbronchial needle aspiration (EBUS-TBNA) procedures, hysicians rely on grayscale ultrasound images to determine the location of lesions. However, these images often contain significant noise and can be influenced by surrounding tissues or blood vessels, making identification challenging. Previous research has lacked the application of object detection models to EBUS-TBNA, and there has been no well-defined solution for the lack of annotated data in the EBUS-TBNA dataset. In related studies on ultrasound images, although models have been successful in capturing target regions for their respective tasks, their training and predictions have been based on two-dimensional images, limiting their ability to leverage temporal features for improved predictions. This study introduces a three-dimensional video-based object detection model. It first generates a set of improved queries using a diffusion model, then captures temporal correlations through an attention mechanism. A filtering mechanism selects relevant information from previous frames to pass to the current frame. Subsequently, a teacher-student model training approach is employed to further optimize the model using unlabeled data. By incorporating various data augmentation and feature alignment, the model gains robustness against interference. Test results demonstrate that this model, which captures spatiotemporal information and employs semi-supervised learning methods, achieves an Average Precision (AP) of 48.7 on the test dataset, outperforming other models. It also achieves an Average Recall (AR) of 79.2, significantly leading over existing models.
- [1392] arXiv:2404.07970 (replaced) [pdf, html, other]
-
Title: Differentiable All-pole Filters for Time-varying Audio SystemsChin-Yun Yu, Christopher Mitcheltree, Alistair Carson, Stefan Bilbao, Joshua D. Reiss, György FazekasComments: Accepted at DAFx 2024Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
Infinite impulse response filters are an essential building block of many time-varying audio systems, such as audio effects and synthesisers. However, their recursive structure impedes end-to-end training of these systems using automatic differentiation. Although non-recursive filter approximations like frequency sampling and frame-based processing have been proposed and widely used in previous works, they cannot accurately reflect the gradient of the original system. We alleviate this difficulty by re-expressing a time-varying all-pole filter to backpropagate the gradients through itself, so the filter implementation is not bound to the technical limitations of automatic differentiation frameworks. This implementation can be employed within audio systems containing filters with poles for efficient gradient evaluation. We demonstrate its training efficiency and expressive capabilities for modelling real-world dynamic audio systems on a phaser, time-varying subtractive synthesiser, and feed-forward compressor. We make our code and audio samples available and provide the trained audio effect and synth models in a VST plugin at this https URL.
- [1393] arXiv:2404.10044 (replaced) [pdf, html, other]
-
Title: Variational quantum simulation: a case study for understanding warm startsComments: 9 + 26 pages, 5 + 2 figuresSubjects: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
The barren plateau phenomenon, characterized by loss gradients that vanish exponentially with system size, poses a challenge to scaling variational quantum algorithms. Here we explore the potential of warm starts, whereby one initializes closer to a solution in the hope of enjoying larger loss variances. Focusing on an iterative variational method for learning shorter-depth circuits for quantum real and imaginary time evolution we conduct a case study to elucidate the potential and limitations of warm starts. We start by proving that the iterative variational algorithm will exhibit substantial (at worst vanishing polynomially in system size) gradients in a small region around the initializations at each time-step. Convexity guarantees for these regions are then established, suggesting trainability for polynomial size time-steps. However, our study highlights scenarios where a good minimum shifts outside the region with trainability guarantees. Our analysis leaves open the question whether such minima jumps necessitate optimization across barren plateau landscapes or whether there exist gradient flows, i.e., fertile valleys away from the plateau with substantial gradients, that allow for training.
- [1394] arXiv:2404.15405 (replaced) [pdf, html, other]
-
Title: Photometry of Saturated Stars with Neural NetworksDominik Winecki (1)Christopher S. Kochanek (2) ((1) Dept. of Computer Science and Engineeering, The Ohio State University (2) Dept. of Astronomy, The Ohio State University)Comments: accepted by ApJSubjects: Solar and Stellar Astrophysics (astro-ph.SR); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
We use a multilevel perceptron (MLP) neural network to obtain photometry of saturated stars in the All-Sky Automated Survey for Supernovae (ASAS-SN). The MLP can obtain fairly unbiased photometry for stars from g~4 to 14~mag, particularly compared to the dispersion (15%-85% 1sigma range around the median) of 0.12 mag for saturated (g<11.5 mag) stars. More importantly, the light curve of a non-variable saturated star has a median dispersion of only 0.037 mag. The MLP light curves are, in many cases, spectacularly better than those provided by the standard ASAS-SN pipelines. While the network was trained on g band data from only one of ASAS-SN's 20 cameras, initial experiments suggest that it can be used for any camera and the older ASAS-SN V band data as well. The dominant problems seem to be associated with correctable issues in the ASAS-SN data reduction pipeline for saturated stars more than the MLP itself. The method is publicly available as a light curve option on ASAS-SN Sky Patrol v1.0.
- [1395] arXiv:2405.04816 (replaced) [pdf, html, other]
-
Title: Testing the Fairness-Improvability of AlgorithmsSubjects: Econometrics (econ.EM); Data Structures and Algorithms (cs.DS); Applications (stat.AP)
Many organizations use algorithms that have a disparate impact, i.e., the benefits or harms of the algorithm fall disproportionately on certain social groups. Addressing an algorithm's disparate impact can be challenging, especially because it is often unclear whether reducing this impact is possible without sacrificing other important objectives of the organization, such as accuracy or profit. Establishing the improvability of algorithms with respect to multiple criteria is of both conceptual and practical interest: in many settings, disparate impact that would otherwise be prohibited under US federal law is permissible if it is necessary to achieve a legitimate business interest. The question is how a policy-maker can formally substantiate, or refute, this necessity defense. In this paper, we provide an econometric framework for testing the hypothesis that it is possible to improve on the fairness of an algorithm without compromising on other pre-specified objectives. Our proposed test is simple to implement and can be applied under any exogenous constraint on the algorithm space. We establish the large-sample validity and consistency of our test, and illustrate its practical application by evaluating a healthcare algorithm originally considered by Obermeyer et al 2019. In this application, we reject the null hypothesis that it is not possible to reduce the algorithm's disparate impact without compromising on the accuracy of its predictions.
- [1396] arXiv:2405.08810 (replaced) [pdf, other]
-
Title: Quantum computing with QiskitAli Javadi-Abhari, Matthew Treinish, Kevin Krsulich, Christopher J. Wood, Jake Lishman, Julien Gacon, Simon Martiel, Paul D. Nation, Lev S. Bishop, Andrew W. Cross, Blake R. Johnson, Jay M. GambettaSubjects: Quantum Physics (quant-ph); Emerging Technologies (cs.ET)
We describe Qiskit, a software development kit for quantum information science. We discuss the key design decisions that have shaped its development, and examine the software architecture and its core components. We demonstrate an end-to-end workflow for solving a problem in condensed matter physics on a quantum computer that serves to highlight some of Qiskit's capabilities, for example the representation and optimization of circuits at various abstraction levels, its scalability and retargetability to new gates, and the use of quantum-classical computations via dynamic circuits. Lastly, we discuss some of the ecosystem of tools and plugins that extend Qiskit for various tasks, and the future ahead.
- [1397] arXiv:2405.10754 (replaced) [pdf, html, other]
-
Title: Stable Phase Retrieval with Mirror DescentSubjects: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
In this paper, we aim to reconstruct an n-dimensional real vector from m phaseless measurements corrupted by an additive noise. We extend the noiseless framework developed in [15], based on mirror descent (or Bregman gradient descent), to deal with noisy measurements and prove that the procedure is stable to (small enough) additive noise. In the deterministic case, we show that mirror descent converges to a critical point of the phase retrieval problem, and if the algorithm is well initialized and the noise is small enough, the critical point is near the true vector up to a global sign change. When the measurements are i.i.d Gaussian and the signal-to-noise ratio is large enough, we provide global convergence guarantees that ensure that with high probability, mirror descent converges to a global minimizer near the true vector (up to a global sign change), as soon as the number of measurements m is large enough. The sample complexity bound can be improved if a spectral method is used to provide a good initial guess. We complement our theoretical study with several numerical results showing that mirror descent is both a computationally and statistically efficient scheme to solve the phase retrieval problem.
- [1398] arXiv:2405.11547 (replaced) [pdf, html, other]
-
Title: Certified Robust Accuracy of Neural Networks Are Bounded due to Bayes ErrorsComments: accepted by CAV 2024Subjects: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Adversarial examples pose a security threat to many critical systems built on neural networks. While certified training improves robustness, it also decreases accuracy noticeably. Despite various proposals for addressing this issue, the significant accuracy drop remains. More importantly, it is not clear whether there is a certain fundamental limit on achieving robustness whilst maintaining accuracy. In this work, we offer a novel perspective based on Bayes errors. By adopting Bayes error to robustness analysis, we investigate the limit of certified robust accuracy, taking into account data distribution uncertainties. We first show that the accuracy inevitably decreases in the pursuit of robustness due to changed Bayes error in the altered data distribution. Subsequently, we establish an upper bound for certified robust accuracy, considering the distribution of individual classes and their boundaries. Our theoretical results are empirically evaluated on real-world datasets and are shown to be consistent with the limited success of existing certified training results, e.g., for CIFAR10, our analysis results in an upper bound (of certified robust accuracy) of 67.49\%, meanwhile existing approaches are only able to increase it from 53.89\% in 2017 to 62.84\% in 2023.
- [1399] arXiv:2405.20904 (replaced) [pdf, html, other]
-
Title: Solving systems of equations on antichains for the computation of the ninth Dedekind NumberSubjects: Combinatorics (math.CO); Discrete Mathematics (cs.DM)
We study systems of equations on antichains, together with a way to count the number of solutions. We start with two simple examples, generalise and show more applications. One of the results was used in the recent computation of D(9), the others have potential to speed up existing techniques in the future.
- [1400] arXiv:2406.03657 (replaced) [pdf, html, other]
-
Title: UrBAN: Urban Beehive Acoustics and PheNotyping DatasetMahsa Abdollahi, Yi Zhu, Heitor R. Guimarães, Nico Coallier, Ségolène Maucourt, Pierre Giovenazzo, Tiago H. FalkSubjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
In this paper, we present a multimodal dataset obtained from a honey bee colony in Montréal, Quebec, Canada, spanning the years of 2021 to 2022. This apiary comprised 10 beehives, with microphones recording more than 2000 hours of high quality raw audio, and also sensors capturing temperature, and humidity. Periodic hive inspections involved monitoring colony honey bee population changes, assessing queen-related conditions, and documenting overall hive health. Additionally, health metrics, such as Varroa mite infestation rates and winter mortality assessments were recorded, offering valuable insights into factors affecting hive health status and resilience. In this study, we first outline the data collection process, sensor data description, and dataset structure. Furthermore, we demonstrate a practical application of this dataset by extracting various features from the raw audio to predict colony population using the number of frames of bees as a proxy.
- [1401] arXiv:2406.04188 (replaced) [pdf, html, other]
-
Title: Digital Twin Aided RIS Communication: Robust Beamforming and Interference ManagementComments: Dataset and code files will be available soon on the DeepMIMIO website: this https URLSubjects: Signal Processing (eess.SP); Information Theory (cs.IT)
Reconfigurable intelligent surfaces (RISs) are envisioned to play a key role in future wireless communication networks. However, channel estimation in RIS-aided wireless networks is challenging due to their passive nature and the large number of reflective elements, leading to high channel estimation overhead. Additionally, conventional methods like beam sweeping, which do not rely on explicit channel state information, often struggle in managing interference in multi-user networks. In this paper, we propose a novel approach that leverages digital twins (DTs) of the physical environments to approximate channels using electromagnetic 3D models and ray tracing, thus relaxing the need for channel estimation and extensive over-the-air computations in RIS-aided wireless networks. To address the digital twins channel approximation errors, we further refine this approach with a DT-specific robust transmission design that reliably meets minimum desired rates. The results show that our method secures these rates over 90% of the time, significantly outperforming beam sweeping, which achieves these rates less than 8% of the time due to its poor management of transmitting power and interference.
- [1402] arXiv:2406.04786 (replaced) [pdf, html, other]
-
Title: Asymptotic Analysis of Near-Field Coupling in Massive MISO and Massive SIMO SystemsComments: Accepted version of the article published in IEEE Communications Letters, 2024. DOI: https://doi.org/10.1109/LCOMM.2024.3416044Subjects: Signal Processing (eess.SP); Systems and Control (eess.SY)
This paper studies the receiver to transmitter antenna coupling in near-field communications with massive arrays. Although most works in the literature consider that it is negligible and approximate it by zero, there is no rigorous analysis on its relevance for practical systems. In this work, we leverage multiport communication theory to obtain conditions for the aforementioned approximation to be valid in MISO and SIMO systems. These conditions are then particularized for arrays with fixed inter-element spacing and arrays with fixed size.
- [1403] arXiv:2406.05128 (replaced) [pdf, html, other]
-
Title: Differentiable Time-Varying Linear Prediction in the Context of End-to-End Analysis-by-SynthesisComments: Accepted at Interspeech 2024Subjects: Audio and Speech Processing (eess.AS); Sound (cs.SD)
Training the linear prediction (LP) operator end-to-end for audio synthesis in modern deep learning frameworks is slow due to its recursive formulation. In addition, frame-wise approximation as an acceleration method cannot generalise well to test time conditions where the LP is computed sample-wise. Efficient differentiable sample-wise LP for end-to-end training is the key to removing this barrier. We generalise the efficient time-invariant LP implementation from the GOLF vocoder to time-varying cases. Combining this with the classic source-filter model, we show that the improved GOLF learns LP coefficients and reconstructs the voice better than its frame-wise counterparts. Moreover, in our listening test, synthesised outputs from GOLF scored higher in quality ratings than the state-of-the-art differentiable WORLD vocoder.
- [1404] arXiv:2406.05714 (replaced) [pdf, html, other]
-
Title: Contextual Continuum Bandits: Static Versus Dynamic RegretSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
We study the contextual continuum bandits problem, where the learner sequentially receives a side information vector and has to choose an action in a convex set, minimizing a function associated to the context. The goal is to minimize all the underlying functions for the received contexts, leading to a dynamic (contextual) notion of regret, which is stronger than the standard static regret. Assuming that the objective functions are Hölder with respect to the contexts, we demonstrate that any algorithm achieving a sub-linear static regret can be extended to achieve a sub-linear dynamic regret. We further study the case of strongly convex and smooth functions when the observations are noisy. Inspired by the interior point method and employing self-concordant barriers, we propose an algorithm achieving a sub-linear dynamic regret. Lastly, we present a minimax lower bound, implying two key facts. First, no algorithm can achieve sub-linear dynamic regret over functions that are not continuous with respect to the context. Second, for strongly convex and smooth functions, the algorithm that we propose achieves, up to a logarithmic factor, the minimax optimal rate of dynamic regret as a function of the number of queries.
- [1405] arXiv:2406.09051 (replaced) [pdf, other]
-
Title: Bayesian Structural Model Updating with Multimodal Variational AutoencoderComments: 44 pages, 21 figuresJournal-ref: Computer Methods in Applied Mechanics and Engineering,Volume 429, 1 September 2024, 117148Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
A novel framework for Bayesian structural model updating is presented in this study. The proposed method utilizes the surrogate unimodal encoders of a multimodal variational autoencoder (VAE). The method facilitates an approximation of the likelihood when dealing with a small number of observations. It is particularly suitable for high-dimensional correlated simultaneous observations applicable to various dynamic analysis models. The proposed approach was benchmarked using a numerical model of a single-story frame building with acceleration and dynamic strain measurements. Additionally, an example involving a Bayesian update of nonlinear model parameters for a three-degree-of-freedom lumped mass model demonstrates computational efficiency when compared to using the original VAE, while maintaining adequate accuracy for practical applications.
- [1406] arXiv:2406.09765 (replaced) [pdf, other]
-
Title: Application of Natural Language Processing in Financial Risk DetectionSubjects: Risk Management (q-fin.RM); Computation and Language (cs.CL)
This paper explores the application of Natural Language Processing (NLP) in financial risk detection. By constructing an NLP-based financial risk detection model, this study aims to identify and predict potential risks in financial documents and communications. First, the fundamental concepts of NLP and its theoretical foundation, including text mining methods, NLP model design principles, and machine learning algorithms, are introduced. Second, the process of text data preprocessing and feature extraction is described. Finally, the effectiveness and predictive performance of the model are validated through empirical research. The results show that the NLP-based financial risk detection model performs excellently in risk identification and prediction, providing effective risk management tools for financial institutions. This study offers valuable references for the field of financial risk management, utilizing advanced NLP techniques to improve the accuracy and efficiency of financial risk detection.
- [1407] arXiv:2406.10703 (replaced) [pdf, html, other]
-
Title: Calibrating Neural Networks' parameters through Optimal Contraction in a Prediction ProblemComments: 14 pages, 2 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
This study introduces a novel approach to ensure the existence and uniqueness of optimal parameters in neural networks. The paper details how a recurrent neural networks (RNN) can be transformed into a contraction in a domain where its parameters are linear. It then demonstrates that a prediction problem modeled through an RNN, with a specific regularization term in the loss function, can have its first-order conditions expressed analytically. This system of equations is reduced to two matrix equations involving Sylvester equations, which can be partially solved. We establish that, if certain conditions are met, optimal parameters exist, are unique, and can be found through a straightforward algorithm to any desired precision. Also, as the number of neurons grows the conditions of convergence become easier to fulfill. Feedforward neural networks (FNNs) are also explored by including linear constraints on parameters. According to our model, incorporating loops (with fixed or variable weights) will produce loss functions that train easier, because it assures the existence of a region where an iterative method converges.
- [1408] arXiv:2406.10959 (replaced) [pdf, html, other]
-
Title: On Convergence and Rate of Convergence of Policy Improvement AlgorithmsComments: In this version we added a convergence result for the 1-dimensional model with diffusion controlsSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG)
In this paper we provide a simple proof from scratch for the convergence of Policy Improvement Algorithm (PIA) for a continuous time entropy-regularized stochastic control problem. Such convergence has been established by Huang-Wang-Zhou(2023) by using sophisticated PDE estimates for the iterative PDEs involved in the PIA. Our approach builds on some Feynman-Kac type probabilistic representation formulae for solutions of PDEs and their derivatives. Moreover, in the infinite horizon model with a large discount factor and in the finite horizon model, we obtain the exponential rate of convergence with similar arguments. Finally, in the one dimensional setting, we extend the convergence result to the diffusion control case.
- [1409] arXiv:2406.12008 (replaced) [pdf, html, other]
-
Title: QC-Forest: a Classical-Quantum Algorithm to Provably Speedup Retraining of Random ForestSubjects: Quantum Physics (quant-ph); Machine Learning (cs.LG)
Random Forest (RF) is a popular tree-ensemble method for supervised learning, prized for its ease of use and flexibility. Online RF models require to account for new training data to maintain model accuracy. This is particularly important in applications where data is periodically and sequentially generated over time in data streams, such as auto-driving systems, and credit card payments. In this setting, performing periodic model retraining with the old and new data accumulated is beneficial as it fully captures possible drifts in the data distribution over time. However, this is unpractical with state-of-the-art classical algorithms for RF as they scale linearly with the accumulated number of samples. We propose QC-Forest, a classical-quantum algorithm designed to time-efficiently retrain RF models in the streaming setting for multi-class classification and regression, achieving a runtime poly-logarithmic in the total number of accumulated samples. QC-Forest leverages Des-q, a quantum algorithm for single tree construction and retraining proposed by Kumar et al. by expanding to multi-class classification, as the original proposal was limited to binary classes, and introducing an exact classical method to replace an underlying quantum subroutine incurring a finite error, while maintaining the same poly-logarithmic dependence. Finally, we showcase that QC-Forest achieves competitive accuracy in comparison to state-of-the-art RF methods on widely used benchmark datasets with up to 80,000 samples, while significantly speeding up the model retrain.
- [1410] arXiv:2406.12108 (replaced) [pdf, other]
-
Title: Computing in the Life Sciences: From Early Algorithms to Modern AIComments: 53 pages, 4 figures, 10 tablesSubjects: Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI)
Computing in the life sciences has undergone a transformative evolution, from early computational models in the 1950s to the applications of artificial intelligence (AI) and machine learning (ML) seen today. This paper highlights key milestones and technological advancements through the historical development of computing in the life sciences. The discussion includes the inception of computational models for biological processes, the advent of bioinformatics tools, and the integration of AI/ML in modern life sciences research. Attention is given to AI-enabled tools used in the life sciences, such as scientific large language models and bio-AI tools, examining their capabilities, limitations, and impact to biological risk. This paper seeks to clarify and establish essential terminology and concepts to ensure informed decision-making and effective communication across disciplines.
- [1411] arXiv:2406.12186 (replaced) [pdf, html, other]
-
Title: Unlocking the Potential of Early Epochs: Uncertainty-aware CT Metal Artifact ReductionSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
In computed tomography (CT), the presence of metallic implants in patients often leads to disruptive artifacts in the reconstructed images, hindering accurate diagnosis. Recently, a large amount of supervised deep learning-based approaches have been proposed for metal artifact reduction (MAR). However, these methods neglect the influence of initial training weights. In this paper, we have discovered that the uncertainty image computed from the restoration result of initial training weights can effectively highlight high-frequency regions, including metal artifacts. This observation can be leveraged to assist the MAR network in removing metal artifacts. Therefore, we propose an uncertainty constraint (UC) loss that utilizes the uncertainty image as an adaptive weight to guide the MAR network to focus on the metal artifact region, leading to improved restoration. The proposed UC loss is designed to be a plug-and-play method, compatible with any MAR framework, and easily adoptable. To validate the effectiveness of the UC loss, we conduct extensive experiments on the public available Deeplesion and CLINIC-metal dataset. Experimental results demonstrate that the UC loss further optimizes the network training process and significantly improves the removal of metal artifacts.
- [1412] arXiv:2406.12763 (replaced) [pdf, html, other]
-
Title: Implicit Bias of Mirror Flow on Separable DataComments: Exact same text as first version but the acknowledgments section is updatedSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
We examine the continuous-time counterpart of mirror descent, namely mirror flow, on classification problems which are linearly separable. Such problems are minimised `at infinity' and have many possible solutions; we study which solution is preferred by the algorithm depending on the mirror potential. For exponential tailed losses and under mild assumptions on the potential, we show that the iterates converge in direction towards a $\phi_\infty$-maximum margin classifier. The function $\phi_\infty$ is the $\textit{horizon function}$ of the mirror potential and characterises its shape `at infinity'. When the potential is separable, a simple formula allows to compute this function. We analyse several examples of potentials and provide numerical experiments highlighting our results.