Conditional random fields in text segmentation by language

  • Robin Cabeza Ruiz Universidad de Holguín
Keywords: Text segmentation by language, conditional random fields.


This work presents using conditional random fields for solving the task of text segmentation by language, considering it as a sequence tagging task. Language changes are considered to occur in every part of the text, observations are assumed to be the words in the text, and the states are the different languages. Research let conclude that conditional random fields are a powerful tool for segmentation of multilingual text. 


Download data is not yet available.

Author Biography

Robin Cabeza Ruiz, Universidad de Holguín

Master in Design Assisted by Computer from the Universidad de Holguín (Cuba, 2015) with a bachelor’s degree in Computer Science from Universidad de Oriente (Cuba, 2017). Currently he is professor of informatics II and member of CAD/CAM Studies Center at the Faculty of Engineering at the Universidad de Holguín. His main areas of interest in research are biomechanical and text segmentation by computer. 


Baldwin, T. & Lu, M. (2010). Multilingual language identification: ALTW 2010 shared task dataset. In Proceedings of the Australasian Language Technology Association Workshop (pp. 4-7).

Barron, A., Rissanen, J., & Yu, B. (1998). The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, 44(6), 2743-2760.

Bird, S. (2006). NLTK: The natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions, (pp. 69-72). Stroudsburg, PA: Association for Computational Linguistics.

Cabeza, R. (2016). Text segmentation by language. Sistemas & Telemática,14(38), 65-74. doi 10.18046/syt.v14i38.2289

Cook, P. & Lui, M. (2012). for better language modelling. In Proceedings of the Australasian Language Technology Association Workshop, (pp. 107-112).

He, X., Zemel, R. S., & Carreira-Perpiñán, M. Á. (2004). Multiscale conditional random fields for image labeling. In Computer vision and pattern recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE computer society conference on (Vol. 2, pp. II-II). IEEE.

Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic model for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, (pp. 282-289).

Liu, Y., Carbonell, J., Weigele, P., & Gopalakrishnan, V. (2006). Protein fold recognition using segmentation conditional random fields (SCRFs). Journal of Computational Biology, 13(2), 394-406.

Lui, M., Lau, J. H., & Baldwin, T. (2014). Automatic detection and language identification of multilingual documents. Transactions of the Association for Computational Linguistics, 2, 27-40.

Peng, F. & McCallum, A. (2004). Accurate information extraction from research papers using conditional random fields. In Human Language Texhnology Conference and North American Chapter of the Association for Computational Linguistics. Retrieved from:

Settles, B. (2005). Abner: An open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics, 21(14), 3191-3192.

Sha, F., & Pereira, F. (2003). Shallow parsing with conditional random fields. In NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, (Vol. 1, pp. 134-141). Stroudsburg, PA: Association for Computational Linguistics.

Singh, A. K., & Gorla, J. (2007). Identification of languages and encodings in a multilingual document. In Building and Exploring Web Corpora (WAC3-2007): Proceedings of the 3rd Web as Corpus Workshop, Incorporating Cleaneval (Vol. 4, p. 95). Louvain, Belgium: Louvain Université.

Vásquez, A. C., Quispe, J. P., & Huayana, A. M. (2009). Procesamiento de lenguaje natural. Revista de Investigación de Sistemas e Informática, 6(2), 45-54.

Yamaguchi, H., & Tanaka-Ishii, K. (2012). Text Segmentation by Language using minimum description length. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume1, (pp. 969-978). Stroudsburg, PA: Association for Computational Linguistics.
Discussion papers