DNAGAST: Generative Adversarial Set Transformers for High-throughput Sequencing

Authors

  • David W. Ludwig II Department of Computer Science Middle Tennessee State University Murfreesboro, TN. USA
  • Joshua L. Phillips Department of Computer Science Middle Tennessee State University Murfreesboro, TN. USA

DOI:

https://doi.org/10.32473/flairs.38.1.139043

Abstract

High-throughput sequencing (HTS) is a modern DNA sequencing technology used to rapidly read thousands of genomic fragments from microorganisms given a sample. The large amount of data produced by this process makes deep learning, whose performance often scales with dataset size, a suitable fit for processing HTS samples. While deep learning models have utilized sets of DNA sequences to make informed predictions, to our knowledge, there are no models in the current literature capable of generating synthetic HTS samples, a tool which could enable experimenters to predict HTS samples given some environmental parameters. Furthermore, the unordered nature of HTS samples poses a challenge to nearly all deep learning architectures because they have an inherent dependence on input order. To address this gap in the literature, we introduce DNA Generative Adversarial Set Transformer (DNAGAST), the first model capable of generating synthetic HTS samples.We qualitatively and quantitatively demonstrate DNAGAST’s ability to produce realistic synthetic samples and explore various methods to mitigate mode-collapse. Additionally, we propose novel quantitative diversity metrics to measure the effects of mode-collapse for unstructured set-based data.

Downloads

Published

14-05-2025

How to Cite

Ludwig II, D. W., & Phillips, J. L. (2025). DNAGAST: Generative Adversarial Set Transformers for High-throughput Sequencing. The International FLAIRS Conference Proceedings, 38(1). https://doi.org/10.32473/flairs.38.1.139043