Text segmentation by language

Robin Cabeza Ruiz

Abstract


There are two approaches for text segmentation by language: first, assuming that language changes happen in the “border” between sentences (never within a sentence); second, assuming that language changes can happen anyplace in the text. This work presents methods for both types of text’s segmentation by languages. On the first proposal, the text is initially segmented by sentence, then the language of each sentence is obtained; the second proposal is an adaptation of hidden Markov model to this task. Both cases, according to results obtained in experimental proofs, exceed the state of art.


Keywords


Hidden Markov model; text segmentation by language; natural language processing.

Full Text:

PDF

References


Barron, A., Rissanen, J., & Yu, B. (1998). The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, 44(6), 2743-2760.

Bird, S. (2006, July). NLTK: the natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions (pp. 69-72). Stroudsburg PA: Association for Computational Linguistics.

Blunsom, P. (2004). Hidden Markov models. Retrieved from: http://digital.cs.usu.edu/~cyan/CS7960/hmm-tutorial.pdf

Cabeza, R. (2015). Segmentación de textos por idiomas: utilizando modelos ocultos de Markov. Saarbrücken, Germany: EAE.

Ghahramani, Z. (2001). An introduction to hidden Markov models and bayesian networks. International Journal of Pattern Recognition and Artificial Intelligence, 15(01),9-42.

Juola, P. (1997). What can we do with small corpora? Document categorization via cross-entropy. Edinburgh, UK: University of Edinburgh.

Lui, M. & Cook, P. (2012). langid.py for better language modelling. In: Proceedings of Australasian Language Technology Association Workshop, Vol. 10 (pp. 107−112). Retrieved from:

http://www.alta.asn.au/events/alta2012/proceedings/pdf/U12-1.pdf

Lui, M. (2016). Langid.py [app]. Retrieved from: https://github.com/saffsd/langid.py

Lui, M., Lau, J. H., & Baldwin, T. (2014). Automatic detection and language identification of multilingual documents. Transactions of the Association for Computational Linguistics, 2, 27-40.

Rincón, L. (2012). Introducción a los procesos estocásticos. México, DF: UNAM. Available at: http://lya.fciencias.unam.mx/lars/Publicaciones/procesos2012.pdf

Vásquez, A. C., Quispe, J. P., & Huayana, A. M. (2009). Procesamiento de Lenguaje Natural. Revista de investigación de Sistemas e Informática, 6(2), 45-54.

Witten, I. H. & Bell, T. C. (1991). The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compresion. IEEE Transactions on Information Theory, 37(40), 1085-1094.

Yamaguchi, H. & Tanaka-Ishii, K. (2012). Text segmentation by language using minimum description length. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1 (pp. 969-978). Stroudsburg, PA: ACL.




DOI: http://dx.doi.org/10.18046/syt.v14i38.2289

Refbacks

  • There are currently no refbacks.

Comments on this article

View all comments