2021-06-16T08:19:29Z
urn:hdl:10641/2327
A light method for data generation: a combination of Markov Chains and Word Embeddings.
Martínez García, Eva
Nogales Moyano, Alberto
Morales Escudero, Javier
García Tejedor, Álvaro José
Generation
Hybrid
Markov Chains
Embeddings
Similarity
Most of the current state-of-the-art Natural Language Processing (NLP) techniques are highly data-dependent. A significant amount of data is required for their training, and in some scenarios data is scarce. We present a hybrid method to generate new sentences for augmenting the training data. Our approach takes advantage of the combination of Markov Chains and word embeddings to produce high-quality data similarto an initial dataset. In contrast to other neural-based generative methods, it does not need a high amount of training data. Results show how our approach can generate useful data for NLP tools. In particular, we validate our approach by building Transformer-based Language Models using data from three different domains in the context of enriching general purpose chatbots.
2021-06-16T08:19:29Z
2021-06-16T08:19:29Z
2020
article
1135-5948
http://hdl.handle.net/10641/2327
10.26342/2020-64-10
eng
http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6199
http://creativecommons.org/licenses/by-nc-nd/3.0/es/
openAccess
Atribución-NoComercial-SinDerivadas 3.0 España
Procesamiento del Lenguaje Natural