Domain-agnostic Document Representation Learning Using Latent Topics and Metadata

Natraj Raman; Armineh Nourbakhsh; Sameena Shah; Manuela Veloso

doi:10.32473/flairs.v34i1.128388

Authors

Natraj Raman J.P.Morgan AI Research
Armineh Nourbakhsh J.P.Morgan AI Research
Sameena Shah J.P.Morgan AI Research
Manuela Veloso J.P.Morgan AI Research

DOI:

https://doi.org/10.32473/flairs.v34i1.128388

Keywords:

representation learning;, self-supervision;, text metadata;, few-shot learning;

Abstract

Fine-tuning a pre-trained neural language model with a task specific output layer is the de facto approach of late when dealing with document classification. This technique is inadequate when labeled examples are unavailable at training time and when the metadata artifacts in a document must be exploited. We address these challenges by generating document representations that capture both text and metadata in a task agnostic manner. Instead of traditional auto-regressive or auto-encoding based training, our novel self-supervised approach learns a soft-partition of the input space when generating text embeddings by employing a pre-learned topic model distribution as surrogate labels. Our solution also incorporates metadata explicitly rather than just augmenting them with text. The generated document embeddings exhibit compositional characteristics and are directly used by downstream classification tasks to create decision boundaries from a small number of labels, thereby eschewing complicated recognition methods. We demonstrate through extensive evaluation that our proposed cross-model fusion solution outperforms several competitive baselines on multiple domains.

Domain-agnostic Document Representation Learning Using Latent Topics and Metadata

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Developed By

Make a Submission

Language