A light method for data generation: a combination of Markov Chains and Word Embeddings.
Un método ligero de generación de datos: combinación entre Cadenas de Markov y Word Embeddings.
Martínez García, Eva
Nogales Moyano, Alberto
Morales Escudero, Javier
García Tejedor, Álvaro José
Generation
Hybrid
Markov Chains
Embeddings
Similarity
Most of the current state-of-the-art Natural Language Processing (NLP) techniques are highly data-dependent. A significant amount of data is required for their training, and in some scenarios data is scarce. We present a hybrid method to generate new sentences for augmenting the training data. Our approach takes advantage of the combination of Markov Chains and word embeddings to produce high-quality data similar to an initial dataset. In contrast to other neural-based generative methods, it does not need a high amount of training data. Results show how our approach can generate useful data for NLP tools. In particular, we validate our approach by building Transformer-based Language Models using data from three different domains in the context of enriching general purpose chatbots.
post-print
1,74 MB
2021-06-16T08:19:29Z
2021-06-16T08:19:29Z
2020
article
1135-5948
http://hdl.handle.net/10641/2327
10.26342/2020-64-10
eng
http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6199
Atribución-NoComercial-SinDerivadas 3.0 España
http://creativecommons.org/licenses/by-nc-nd/3.0/es/
openAccess
Procesamiento del Lenguaje Natural