The Project


Modern ML applications are often data hungry — sometimes it is caused by the extensive process of data collection, and sometimes it is caused by the striking diversity of the data format that makes it hard to construct a homogeneous large dataset. Given an input dataset from the user, the first step of the process,, aims at providing functionalities of automatic data augmentation (adding new data examples automatically) and automatic data ingestion (automatically normalizing data into the same, machine readable, format).

The current version of focuses on the NLP domain. For data augmentation, we designed a novel framework of automatic style transfer for natural language text (e.g., passive to active voice). For data ingestion, we designed a robust system, based on a novel weak supervision technique, to automatically parse documents in different formats (e.g., PDFs, Word, scanned documents, etc.) to machine readable JSON objects.

Input: Input dataset with labels.

Output: Augmented, machine readable, dataset with labels.



Control, Generate, Augment: A Scalable Framework for Multi-​Attribute Text Generation
G Russo, N Hollenstein, C Musat, C Zhang
[EMNLP] Findings of the Association for Computational Linguistics

We introduce CGA, a conditional VAE architecture, to control, generate, and augment text. CGA is able to generate natural English sentences controlling multiple semantic and syntactic attributes by combining adversarial learning with a context-aware loss and a cyclical word dropout routine. We demonstrate the value of the individual model components in an ablation study. The scalability of our approach is ensured through a single discriminator, independently of the number of attributes. We show high quality, diversity and attribute control in the generated sentences through a series of automatic and human assessments. As the main application of our work, we test the potential of this new NLG model in a data augmentation scenario. In a downstream NLP task, the sentences generated by our CGA model show significant improvements over a strong baseline, and a classification performance often comparable to adding same amount of additional real data.

Docparser: Hierarchical structure parsing of document renderings
J Rausch, O Martinez, F Bissig, C Zhang, S Feuerriegel
[AAAI] Proceedings of the AAAI Conference on Artificial Intelligence

Translating renderings (e. g. PDFs, scans) into hierarchical document structures is extensively demanded in the daily routines of many real-world applications. However, a holistic, principled approach to inferring the complete hierarchical structure of documents is missing. As a remedy, we developed “DocParser”: an end-to-end system for parsing the complete document structure - including all text elements, nested figures, tables, and table cell structures. Our second contribution is to provide a dataset for evaluating hierarchical document structure parsing. Our third contribution is to propose a scalable learning framework for settings where domain-specific data are scarce, which we address by a novel approach to weak supervision that significantly improves the document structure parsing performance. Our experiments confirm the effectiveness of our proposed weak supervision: Compared to the baseline without weak supervision, it improves the mean average precision for detecting document entities by 39.1 % and improves the F1 score of classifying hierarchical relations by 35.8 %.