Conditional random fields in text segmentation by language

Robin Cabeza Ruiz

Abstract


This work presents using conditional random fields for solving the task of text segmentation by language, considering it as a sequence tagging task. Language changes are considered to occur in every part of the text, observations are assumed to be the words in the text, and the states are the different languages. Research let conclude that conditional random fields are a powerful tool for segmentation of multilingual text. 


Keywords


Text segmentation by language; conditional random fields.

Full Text:

PDF

References


Baldwin, T. & Lu, M. (2010). Multilingual language identification: ALTW 2010 shared task dataset. In Proceedings of the Australasian Language Technology Association Workshop (pp. 4-7).

Barron, A., Rissanen, J., & Yu, B. (1998). The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, 44(6), 2743-2760.

Bird, S. (2006). NLTK: The natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions, (pp. 69-72). Stroudsburg, PA: Association for Computational Linguistics.

Cabeza, R. (2016). Text segmentation by language. Sistemas & Telemática,14(38), 65-74. doi 10.18046/syt.v14i38.2289

Cook, P. & Lui, M. (2012). langid.py for better language modelling. In Proceedings of the Australasian Language Technology Association Workshop, (pp. 107-112).

He, X., Zemel, R. S., & Carreira-Perpiñán, M. Á. (2004). Multiscale conditional random fields for image labeling. In Computer vision and pattern recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE computer society conference on (Vol. 2, pp. II-II). IEEE.

Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic model for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, (pp. 282-289).

Liu, Y., Carbonell, J., Weigele, P., & Gopalakrishnan, V. (2006). Protein fold recognition using segmentation conditional random fields (SCRFs). Journal of Computational Biology, 13(2), 394-406.

Lui, M., Lau, J. H., & Baldwin, T. (2014). Automatic detection and language identification of multilingual documents. Transactions of the Association for Computational Linguistics, 2, 27-40.

Peng, F. & McCallum, A. (2004). Accurate information extraction from research papers using conditional random fields. In Human Language Texhnology Conference and North American Chapter of the Association for Computational Linguistics. Retrieved from: https://people.cs.umass.edu/~mccallum/papers/hlt2004.pdf

Settles, B. (2005). Abner: An open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics, 21(14), 3191-3192.

Sha, F., & Pereira, F. (2003). Shallow parsing with conditional random fields. In NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, (Vol. 1, pp. 134-141). Stroudsburg, PA: Association for Computational Linguistics.

Singh, A. K., & Gorla, J. (2007). Identification of languages and encodings in a multilingual document. In Building and Exploring Web Corpora (WAC3-2007): Proceedings of the 3rd Web as Corpus Workshop, Incorporating Cleaneval (Vol. 4, p. 95). Louvain, Belgium: Louvain Université.

Vásquez, A. C., Quispe, J. P., & Huayana, A. M. (2009). Procesamiento de lenguaje natural. Revista de Investigación de Sistemas e Informática, 6(2), 45-54.

Yamaguchi, H., & Tanaka-Ishii, K. (2012). Text Segmentation by Language using minimum description length. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume1, (pp. 969-978). Stroudsburg, PA: Association for Computational Linguistics.




DOI: http://dx.doi.org/10.18046/syt.v15i43.2712

Refbacks

  • There are currently no refbacks.

Comments on this article

View all comments


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.