The current version of ease.ml/DataMagic focuses on the NLP domain. For data augmentation, we designed a novel framework of automatic style transfer for natural language text (e.g., passive to active voice). For data ingestion, we designed a robust system, based on a novel weak supervision technique, to automatically parse documents in different formats (e.g., PDFs, Word, scanned documents, etc.) to machine readable JSON objects.
Input: Input dataset with labels.
Output: Augmented, machine readable, dataset with labels.
[EMNLP] Findings of the Association for Computational Linguistics
We introduce CGA, a conditional VAE architecture, to control, generate, and augment text. CGA is able to generate natural English sentences controlling multiple semantic and syntactic attributes by combining adversarial learning with a context-aware loss and a cyclical word dropout routine. We demonstrate the value of the individual model components in an ablation study. The scalability of our approach is ensured through a single discriminator, independently of the number of attributes. We show high quality, diversity and attribute control in the generated sentences through a series of automatic and human assessments. As the main application of our work, we test the potential of this new NLG model in a data augmentation scenario. In a downstream NLP task, the sentences generated by our CGA model show significant improvements over a strong baseline, and a classification performance often comparable to adding same amount of additional real data.
[AAAI] Proceedings of the AAAI Conference on Artificial Intelligence
Translating renderings (e. g. PDFs, scans) into hierarchical document structures is extensively demanded in the daily routines of many real-world applications. However, a holistic, principled approach to inferring the complete hierarchical structure of documents is missing. As a remedy, we developed “DocParser”: an end-to-end system for parsing the complete document structure - including all text elements, nested figures, tables, and table cell structures. Our second contribution is to provide a dataset for evaluating hierarchical document structure parsing. Our third contribution is to propose a scalable learning framework for settings where domain-specific data are scarce, which we address by a novel approach to weak supervision that significantly improves the document structure parsing performance. Our experiments confirm the effectiveness of our proposed weak supervision: Compared to the baseline without weak supervision, it improves the mean average precision for detecting document entities by 39.1 % and improves the F1 score of classifying hierarchical relations by 35.8 %.