Text Authorship Identification Based On Ensemble Learning and Genetic Algorithm Combination in Turkish Text

Merve Güllü; Hüseyin Polat

doi:10.2339/politeknik.992493

Araştırma Makalesi

Text Authorship Identification Based On Ensemble Learning and Genetic Algorithm Combination in Turkish Text

Yıl 2022, Cilt: 25 Sayı: 3, 1287 - 1297, 01.10.2022

Merve Güllü Hüseyin Polat

https://doi.org/10.2339/politeknik.992493

Cited By: 3

Öz

The easiness of reaching information through the internet and social media and the expansiveness of opportunities for searching, copying, and spreading data have caused some problems in identifying an author for a specific text. A text carries the characteristic features of the person who wrote it, and these features can be used to identify its author. For this study, we are offering a method that is based on an approach using ensemble learning algorithm (ELA) and genetic algorithm (GA) for author identification in Tur-kish texts. The raw data set, which includes 40 authors and 3269 texts, was created from Turkish news websites and analyzed in pre-processing step. After, syntactic and structural analyses were done on the data and, in total, 6 different data sets were created. Each of the data sets was subjected to the feature selection process by using GA and ELA approach together. Each of the obtained data sets from the previous step was classified by using the ELA's bagging method which contains 5 different classifiers, namely, Naive Bayes, K-Nearest Neighbor, Artificial Neural Networks, Support Vector Machine, and Decision Tree. After applying the aforementioned processes to the raw data, the author identification approach reached 89% accuracy. The combination of ELA and GA has a strong potential to identify the author of a text.

Anahtar Kelimeler

author identification, ensemble learning, genetic algorithm, feature selection

Kaynakça

[1] T. Neal, K. Sundararajan, A. Fatima, Y. Yan, Y. Xiang, and D. Woodard, “Surveying Stylometry Techniques and Applications,” ACM Comput. Surv., 50(6):1–36, (2018).
[2] S. E. De Morgan and A. De Morgan, “Memoir of Augustus de Morgan by his wife Sophia Elizabeth de Morgan with selections from his letters.,” London Longmans, Green, Co., (1882).
[3] T. C. Mendenhall, “The Characteristic Curves of Composition,” Science (80-. )., 9(214):237–249, (1887).
[4] G. U. Yule, “The statistical study of literary vocabulary,” Cambridge [engl. Univ. Press, (1944).
[5] F. Mosteller and D. L. Wallace, “Inference and disputed authorship: the federalist papers,” Addison-Wesley, Reading, Mass, (1964).
[6] R. Sarwar, T. Porthaveepong, A. Rutherford, T. Rakthanmanon, and S. Nutanong, “StyloThai: A scalable framework for stylometric authorship identification of Thai documents,” ACM Trans. Asian Low-Resource Lang. Inf. Process., 19 (3), (2020).
[7] A. F. Otoom, E. E. Abdullah, S. Jaafer, A. Hamdallh, and D. Amer, “Towards author identification of Arabic text articles,” in 2014 5th International Conference on Information and Communication Systems (ICICS), 1–4, (2017).
[8] S. Ouamour and H. Sayoud, “Authorship Attribution of Short Historical Arabic Texts Based on Lexical Features,” in 2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, 144–147, (2013).
[9] D. L. Hoover, “Statistical Stylistics and Authorship Attribution: an Empirical Investigation,” Lit. Linguist. Comput., 16 (4): 421–444, (2001).
[10] H. Sayoud, “Author discrimination between the holy Quran and Prophet’s statements,” Lit. Linguist. Comput., 27(4): 427–444, (2012).
[11] J. Diederich, J. Kindermann, E. Leopold, and G. Paass, “Authorship attribution with support vector machines,” Appl. Intell., 19(1): 109–123, (2003).
[12] M. Koppel, D. Mughaz, and N. Akiva, “New methods for attribution of Rabbinic literature. Hebrew Linguistics: A Journal for Hebrew Descriptive,” Comput. Appl. Linguist., 57:. 5–18, (2006).
[13] R. Zheng, J. Li, H. Chen, and Z. Huang, “A framework for authorship identification of online messages: Writing-style features and classification techniques,” J. Am. Soc. Inf. Sci. Technol., 57(3): 378–393, (2006).
[14] V. Keselj, F. Peng, N. Cercone, and C. Thomas, “N-gram-based author profiles for authorship attribution,” Proc. Pacific Assoc. Comput. Linguist.,255–264, (2003).
[15] O. V. Kukushkina, A. A. Polikarpov, and D. V. Khmelev, “Using Literal and Grammatical Statistics for Authorship Attribution,” Probl. Inf. Transm., 37(2): 172–184, (2001).
[16] P. Juola, “A Controlled-corpus Experiment in Authorship Identification by Cross-entropy,” Lit. Linguist. Comput., 20(1): 59–67, (2005).
[17] J. Savoy, “Comparative evaluation of term selection functions for authorship attribution,” Digit. Scholarsh. Humanit., 30( 2): 246–261, (2015).
[18] E. Ekinci and H. Takci, “Using authorship analysis techniques in forensic analysis of electronic mails,” in 2012 20th Signal Processing and Communications Applications Conference (SIU), 1–4, (2012).
[19] H. V. Agun, S. Yilmazel, and O. Yilmazel, “Effects of language processing in Turkish authorship attribution,” in 2017 IEEE International Conference on Big Data (Big Data),. 1876–1881,(2017).
[20] E. Aydemir, “Türkçe Köşe Yazılarında Yapay Sinir Ağlarıyla Yazar ve Gazete Tahmin Etme,” DÜMF Mühendislik Derg., 10(1): 45–56, (2019).
[21] F. Türkoğlu, B. Diri, and M. F. Amasyalı, “Author Attribution of Turkish Texts by Feature Mining,” in Advanced Intelligent Computing Theories and Applications. With Aspects of Theoretical and Methodological Issues, Berlin, Heidelberg: Springer Berlin Heidelberg, 1086–1093, (2007).
[22] Y. Aktaş, E. Y. İnce, and A. Çakir, “Doğal Dil İşleme Kulla narak Bilgisayar Ağ Terimlerinin Wordnet Ontolojisinde Uyarlanması Wordnet Ontology Based Creation Of Computer Network Terms By Using Natural Language Processing,” (2017).
[23] M. Zhou, N. Duan, S. Liu, and H.-Y. Shum, “Progress in Neural NLP: Modeling, Learning, and Reasoning,” Engineering, 6(3): 275–290, (2020).
[24] H. Polat and M. Körpe, “TBMM Genel Kurul Tutanaklarından Yakın Anlamlı Kavramların Çıkarılması,” Bilişim Teknol. Derg., 11(3), (2018).
[25] N. Doğan, “İstem Sözlükleri ve Türkçe,” J. Acad. Soc. Sci. Stud., 1(42): 251, (2016).
[26] O. Coban and I. Karabey, “Music genre classification with word and document vectors,” in 2017 25th Signal Processing and Communications Applications Conference (SIU), 1–4, (2017).
[27] E. Yıldırım, F. Çetin, E. G., and T. T., “The Impact of NLP on Turkish Sentiment Analysis,” Türkiye Bilişim Vakfı Bilgi. Bilim. ve Mühendislik Dergisi, 43–51, (2015).
[28] A. S. Yüksel and F. G. Tan, “Metin Madenciliği Teknikleri ile Sosyal Ağlarda Bilgi Keşfi,” Mühendislik Bilim. ve Tasarım Derg., 6(2): 324–333, (2018).
[29] A. G. Vural, B. B. Cambazoglu, P. Senkul, and Z. O. Tokgoz, “A Framework for Sentiment Analysis in Turkish: Application to Polarity Detection of Movie Reviews in Turkish,” in Computer and Information Sciences III, London: Springer London, 437–445, (2013).
[30] C. Bechikh Ali, H. Haddad, and Y. Slimani, “Empirical evaluation of compounds indexing for Turkish texts,” Comput. Speech Lang., 56: 95–106, (2019).
[31] A. A. Akın and M. D. Akın, “Zemberek, an open source NLP framework for Turkic Languages,” Structure, 10: 1–5, (2007).
[32] E. Loper and S. Bird, “NLTK: the Natural Language Toolkit,” in Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics -, 1: 63–70, (2002).
[33] N. An, H. Ding, J. Yang, R. Au, and T. F. A. Ang, “Deep ensemble learning for Alzheimer’s disease classification,” J. Biomed. Inform., 105: 103411, (2020).
[34] Y. Zhu, W. XU, G. Luo, H. Wang, J. Yang, and W. Lu, “Random Forest enhancement using improved Artificial Fish Swarm for the medial knee contact force prediction,” Artif. Intell. Med., 103: 101811, (2020).
[35] L. Breiman, “Bagging predictors” Mach. Learn., 24(2): 123–140, (1996).
[36] S. Agarwal and C. R. Chowdary, “A-Stacking and A-Bagging: Adaptive versions of ensemble learning algorithms for spoof fingerprint detection,” Expert Syst. Appl., 146: 113160, (2020).
[37] J. H. Holland, “Genetic algorithms,” Sci. Am., 267( 1): 66–73, (1992).
[38] J. Yang and V. Honavar, “Feature subset selection using a genetic algorithm,” IEEE Intell. Syst., 13(2): 44–49, (1998).
[39] G. L. Pappa, A. A. Freitas, and C. A. A. Kaestner, “Attribute Selection with a Multi-objective Genetic Algorithm,”, 280–290, (2002).
[40] T. Taş and A. K. Görür, “Author Identification for Turkish Texts,” Çankaya Üniversitesi Fen-Edebiyat Fakültesi, J. Arts Sci., 7: 151–161, (2007).
[41] S. Doğan and B. Diri, “Türkçe Dokümanlar İçin N-gram Tabanlı Yeni Bir Sınıflandırma ( Ng-ind ): Yazar , Tür ve Cinsiyet,” Türkiye Bilişim Vakfı Bilgi. Bilim. ve Mühendisliği Derg, 1(3): 11–19, (2010).
[42] T. Uyar, K. Karacan Uyar, and E. Yağlı, “Gözetimli Makine Öğrenmesiyle Noktalama ve Etkisiz Kelime Sıklıkları Kullanarak Yazar Tanıma,” Bilişim Teknol. Derg.,14(2): 183–190, (2021).

Türkçe Metinde Topluluk Öğrenme ve Genetik Algoritma Kombinasyonu Tabanlı Yazar Tahmini

Yıl 2022, Cilt: 25 Sayı: 3, 1287 - 1297, 01.10.2022

Merve Güllü Hüseyin Polat

https://doi.org/10.2339/politeknik.992493

Cited By: 3

Öz

İnternet ve sosyal medya aracılığıyla bilgiye ulaşmanın kolaylaşması ve veri arama, kopyalama ve yayma olanaklarının geniş olması, belirli bir metin için yazar belirlemede bazı sorunlara neden olmuştur. Bir metin, onu yazan kişinin karakteristik özelliklerini taşır ve bu özellikler onun yazarını belirlemek için kullanılabilir. Bu çalışma için, Türkçe metinlerde yazar tespiti için topluluk öğrenme algo-ritması (TÖA) ve genetik algoritma (GA) kullanan bir yaklaşıma dayalı bir yöntem sunuyoruz. 40 yazar ve 3269 metinden oluşan ham veri seti Türkçe haber sitelerinden oluşturulmuş ve ön işleme aşamasında analiz edilmiştir. Daha sonra veriler üzerinde sözdi-zimsel ve yapısal analizler yapılmış ve toplamda 6 farklı veri seti oluşturulmuştur. Veri setlerinin her biri, GA ve TÖA yaklaşımı birlikte kullanılarak öznitelik seçim sürecine tabi tutulmuştur. Bir önceki adımdan elde edilen veri setlerinin her biri, TÖA'nın Naive Bayes, K-En Yakın Komşu, Yapay Sinir Ağları, Destek Vektör Makinesi ve Karar Ağacı olmak üzere 5 farklı sınıflandırıcı içeren torbalama yöntemi kullanılarak sınıflandırılmıştır. Ham verilere yukarıda bahsedilen işlemler uygulandıktan sonra yazar belirleme yaklaşımı %89 doğruluğa ulaşmıştır. TÖA ve GA kombinasyonu, bir metnin yazarını belirlemek için güçlü bir potansiyele sahiptir.

Anahtar Kelimeler

yazar tespiti, topluluk öğrenme, genetik algoritma, özellik seçimi

Kaynakça

[1] T. Neal, K. Sundararajan, A. Fatima, Y. Yan, Y. Xiang, and D. Woodard, “Surveying Stylometry Techniques and Applications,” ACM Comput. Surv., 50(6):1–36, (2018).
[2] S. E. De Morgan and A. De Morgan, “Memoir of Augustus de Morgan by his wife Sophia Elizabeth de Morgan with selections from his letters.,” London Longmans, Green, Co., (1882).
[3] T. C. Mendenhall, “The Characteristic Curves of Composition,” Science (80-. )., 9(214):237–249, (1887).
[4] G. U. Yule, “The statistical study of literary vocabulary,” Cambridge [engl. Univ. Press, (1944).
[5] F. Mosteller and D. L. Wallace, “Inference and disputed authorship: the federalist papers,” Addison-Wesley, Reading, Mass, (1964).
[6] R. Sarwar, T. Porthaveepong, A. Rutherford, T. Rakthanmanon, and S. Nutanong, “StyloThai: A scalable framework for stylometric authorship identification of Thai documents,” ACM Trans. Asian Low-Resource Lang. Inf. Process., 19 (3), (2020).
[7] A. F. Otoom, E. E. Abdullah, S. Jaafer, A. Hamdallh, and D. Amer, “Towards author identification of Arabic text articles,” in 2014 5th International Conference on Information and Communication Systems (ICICS), 1–4, (2017).
[8] S. Ouamour and H. Sayoud, “Authorship Attribution of Short Historical Arabic Texts Based on Lexical Features,” in 2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, 144–147, (2013).
[9] D. L. Hoover, “Statistical Stylistics and Authorship Attribution: an Empirical Investigation,” Lit. Linguist. Comput., 16 (4): 421–444, (2001).
[10] H. Sayoud, “Author discrimination between the holy Quran and Prophet’s statements,” Lit. Linguist. Comput., 27(4): 427–444, (2012).
[11] J. Diederich, J. Kindermann, E. Leopold, and G. Paass, “Authorship attribution with support vector machines,” Appl. Intell., 19(1): 109–123, (2003).
[12] M. Koppel, D. Mughaz, and N. Akiva, “New methods for attribution of Rabbinic literature. Hebrew Linguistics: A Journal for Hebrew Descriptive,” Comput. Appl. Linguist., 57:. 5–18, (2006).
[13] R. Zheng, J. Li, H. Chen, and Z. Huang, “A framework for authorship identification of online messages: Writing-style features and classification techniques,” J. Am. Soc. Inf. Sci. Technol., 57(3): 378–393, (2006).
[14] V. Keselj, F. Peng, N. Cercone, and C. Thomas, “N-gram-based author profiles for authorship attribution,” Proc. Pacific Assoc. Comput. Linguist.,255–264, (2003).
[15] O. V. Kukushkina, A. A. Polikarpov, and D. V. Khmelev, “Using Literal and Grammatical Statistics for Authorship Attribution,” Probl. Inf. Transm., 37(2): 172–184, (2001).
[16] P. Juola, “A Controlled-corpus Experiment in Authorship Identification by Cross-entropy,” Lit. Linguist. Comput., 20(1): 59–67, (2005).
[17] J. Savoy, “Comparative evaluation of term selection functions for authorship attribution,” Digit. Scholarsh. Humanit., 30( 2): 246–261, (2015).
[18] E. Ekinci and H. Takci, “Using authorship analysis techniques in forensic analysis of electronic mails,” in 2012 20th Signal Processing and Communications Applications Conference (SIU), 1–4, (2012).
[19] H. V. Agun, S. Yilmazel, and O. Yilmazel, “Effects of language processing in Turkish authorship attribution,” in 2017 IEEE International Conference on Big Data (Big Data),. 1876–1881,(2017).
[20] E. Aydemir, “Türkçe Köşe Yazılarında Yapay Sinir Ağlarıyla Yazar ve Gazete Tahmin Etme,” DÜMF Mühendislik Derg., 10(1): 45–56, (2019).
[21] F. Türkoğlu, B. Diri, and M. F. Amasyalı, “Author Attribution of Turkish Texts by Feature Mining,” in Advanced Intelligent Computing Theories and Applications. With Aspects of Theoretical and Methodological Issues, Berlin, Heidelberg: Springer Berlin Heidelberg, 1086–1093, (2007).
[22] Y. Aktaş, E. Y. İnce, and A. Çakir, “Doğal Dil İşleme Kulla narak Bilgisayar Ağ Terimlerinin Wordnet Ontolojisinde Uyarlanması Wordnet Ontology Based Creation Of Computer Network Terms By Using Natural Language Processing,” (2017).
[23] M. Zhou, N. Duan, S. Liu, and H.-Y. Shum, “Progress in Neural NLP: Modeling, Learning, and Reasoning,” Engineering, 6(3): 275–290, (2020).
[24] H. Polat and M. Körpe, “TBMM Genel Kurul Tutanaklarından Yakın Anlamlı Kavramların Çıkarılması,” Bilişim Teknol. Derg., 11(3), (2018).
[25] N. Doğan, “İstem Sözlükleri ve Türkçe,” J. Acad. Soc. Sci. Stud., 1(42): 251, (2016).
[26] O. Coban and I. Karabey, “Music genre classification with word and document vectors,” in 2017 25th Signal Processing and Communications Applications Conference (SIU), 1–4, (2017).
[27] E. Yıldırım, F. Çetin, E. G., and T. T., “The Impact of NLP on Turkish Sentiment Analysis,” Türkiye Bilişim Vakfı Bilgi. Bilim. ve Mühendislik Dergisi, 43–51, (2015).
[28] A. S. Yüksel and F. G. Tan, “Metin Madenciliği Teknikleri ile Sosyal Ağlarda Bilgi Keşfi,” Mühendislik Bilim. ve Tasarım Derg., 6(2): 324–333, (2018).
[29] A. G. Vural, B. B. Cambazoglu, P. Senkul, and Z. O. Tokgoz, “A Framework for Sentiment Analysis in Turkish: Application to Polarity Detection of Movie Reviews in Turkish,” in Computer and Information Sciences III, London: Springer London, 437–445, (2013).
[30] C. Bechikh Ali, H. Haddad, and Y. Slimani, “Empirical evaluation of compounds indexing for Turkish texts,” Comput. Speech Lang., 56: 95–106, (2019).
[31] A. A. Akın and M. D. Akın, “Zemberek, an open source NLP framework for Turkic Languages,” Structure, 10: 1–5, (2007).
[32] E. Loper and S. Bird, “NLTK: the Natural Language Toolkit,” in Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics -, 1: 63–70, (2002).
[33] N. An, H. Ding, J. Yang, R. Au, and T. F. A. Ang, “Deep ensemble learning for Alzheimer’s disease classification,” J. Biomed. Inform., 105: 103411, (2020).
[34] Y. Zhu, W. XU, G. Luo, H. Wang, J. Yang, and W. Lu, “Random Forest enhancement using improved Artificial Fish Swarm for the medial knee contact force prediction,” Artif. Intell. Med., 103: 101811, (2020).
[35] L. Breiman, “Bagging predictors” Mach. Learn., 24(2): 123–140, (1996).
[36] S. Agarwal and C. R. Chowdary, “A-Stacking and A-Bagging: Adaptive versions of ensemble learning algorithms for spoof fingerprint detection,” Expert Syst. Appl., 146: 113160, (2020).
[37] J. H. Holland, “Genetic algorithms,” Sci. Am., 267( 1): 66–73, (1992).
[38] J. Yang and V. Honavar, “Feature subset selection using a genetic algorithm,” IEEE Intell. Syst., 13(2): 44–49, (1998).
[39] G. L. Pappa, A. A. Freitas, and C. A. A. Kaestner, “Attribute Selection with a Multi-objective Genetic Algorithm,”, 280–290, (2002).
[40] T. Taş and A. K. Görür, “Author Identification for Turkish Texts,” Çankaya Üniversitesi Fen-Edebiyat Fakültesi, J. Arts Sci., 7: 151–161, (2007).
[41] S. Doğan and B. Diri, “Türkçe Dokümanlar İçin N-gram Tabanlı Yeni Bir Sınıflandırma ( Ng-ind ): Yazar , Tür ve Cinsiyet,” Türkiye Bilişim Vakfı Bilgi. Bilim. ve Mühendisliği Derg, 1(3): 11–19, (2010).
[42] T. Uyar, K. Karacan Uyar, and E. Yağlı, “Gözetimli Makine Öğrenmesiyle Noktalama ve Etkisiz Kelime Sıklıkları Kullanarak Yazar Tanıma,” Bilişim Teknol. Derg.,14(2): 183–190, (2021).

Toplam 42 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	Mühendislik
Bölüm	Araştırma Makalesi
Yazarlar	Merve Güllü 0000-0001-7442-1332 Hüseyin Polat 0000-0003-4128-2625
Yayımlanma Tarihi	1 Ekim 2022
Gönderilme Tarihi	7 Eylül 2021
Yayımlandığı Sayı	Yıl 2022 Cilt: 25 Sayı: 3

Kaynak Göster

APA	Güllü, M., & Polat, H. (2022). Text Authorship Identification Based On Ensemble Learning and Genetic Algorithm Combination in Turkish Text. Politeknik Dergisi, 25(3), 1287-1297. https://doi.org/10.2339/politeknik.992493
AMA	Güllü M, Polat H. Text Authorship Identification Based On Ensemble Learning and Genetic Algorithm Combination in Turkish Text. Politeknik Dergisi. Ekim 2022;25(3):1287-1297. doi:10.2339/politeknik.992493
Chicago	Güllü, Merve, ve Hüseyin Polat. “Text Authorship Identification Based On Ensemble Learning and Genetic Algorithm Combination in Turkish Text”. Politeknik Dergisi 25, sy. 3 (Ekim 2022): 1287-97. https://doi.org/10.2339/politeknik.992493.
EndNote	Güllü M, Polat H (01 Ekim 2022) Text Authorship Identification Based On Ensemble Learning and Genetic Algorithm Combination in Turkish Text. Politeknik Dergisi 25 3 1287–1297.
IEEE	M. Güllü ve H. Polat, “Text Authorship Identification Based On Ensemble Learning and Genetic Algorithm Combination in Turkish Text”, Politeknik Dergisi, c. 25, sy. 3, ss. 1287–1297, 2022, doi: 10.2339/politeknik.992493.
ISNAD	Güllü, Merve - Polat, Hüseyin. “Text Authorship Identification Based On Ensemble Learning and Genetic Algorithm Combination in Turkish Text”. Politeknik Dergisi 25/3 (Ekim 2022), 1287-1297. https://doi.org/10.2339/politeknik.992493.
JAMA	Güllü M, Polat H. Text Authorship Identification Based On Ensemble Learning and Genetic Algorithm Combination in Turkish Text. Politeknik Dergisi. 2022;25:1287–1297.
MLA	Güllü, Merve ve Hüseyin Polat. “Text Authorship Identification Based On Ensemble Learning and Genetic Algorithm Combination in Turkish Text”. Politeknik Dergisi, c. 25, sy. 3, 2022, ss. 1287-9, doi:10.2339/politeknik.992493.
Vancouver	Güllü M, Polat H. Text Authorship Identification Based On Ensemble Learning and Genetic Algorithm Combination in Turkish Text. Politeknik Dergisi. 2022;25(3):1287-9.