Research Article
BibTex RIS Cite

Single and Binary Performance Comparison of Data Compression Algorithms for Text Files

Year 2023, Volume: 12 Issue: 3, 783 - 796, 28.09.2023
https://doi.org/10.17798/bitlisfen.1301546

Abstract

Data compression is a technique used to reduce the size of a file. To reduce the size of a file, unnecessary information is removed or parts that repeat the same information are stored once. Thus a lossless compression is achieved. The extracted file has all the features of the compressed original file and can be used in the same way. Data compression can be done using different techniques. Some of these techniques are Huffman coding, Lempel-Ziv-Welch coding and Burrows-Wheeler Transform. Techniques such as Huffman coding, Lempel-Ziv-Welch coding and Burrows-Wheeler Transform are some of them. Which technique to use depends on the type and size of the data to be compressed. Huffman, Lempel-Ziv-Welch, Burrows-Wheeler Transform and Deflate algorithms are the most widely used techniques for text compression. Each algorithm uses different approaches and can produce different results in terms of compression ratios and performance. In this study, different data compression techniques were measured on specific data sets by using them individually and in pairs on top of each other. The most successful result was obtained with the Deflate algorithm when used alone and the achieved compression ratio was 29.08. When considered in the form of stacked pairs, the compression ratio of the Burrows-Wheeler Transform and Deflate gave the best result as 57.36. In addition, when compression is performed in pairs, which algorithm is applied first and which algorithm is applied afterwards can make a significant difference in the compression ratio. In this study, the performance measurements obtained by applying the algorithms in different orders are compared and suggestions are presented to obtain optimum performance

References

  • [1] M. Ignatoski, J. Lerga, L. Stanković, and M. Daković, ‘Comparison of entropy and dictionary based text compression in English, German, French, Italian, Czech, Hungarian, Finnish, and Croatian’, Mathematics, vol. 8, no. 7, p. 1059, Jul. 2020, doi: 10.3390/MATH8071059.
  • [2] I. B. Ginzburg, S. N. Padalko, and M. N. Terentiev, ‘Short Message Compression Scheme for Wireless Sensor Networks’, Moscow Work. Electron. Netw. Technol. MWENT 2020 - Proc., Mar. 2020, doi: 10.1109/MWENT47943.2020.9067371.
  • [3] M. R. Hasan, ‘Data Compression using Huffman based LZW Encoding Technique’, Int. J. Sci. Eng. Res., vol. Volume 2, no. 11, pp. 1–7, 2011, Accessed: Mar. 20, 2023. [Online]. Available: http://www.ijser.org
  • [4] V. Ratnam Anappindi, ‘Issue 8 www.jetir.org (ISSN-2349-5162)’, JETIREZ06012 J. Emerg. Technol. Innov. Res., vol. 8, 2021, doi: 10.1109/EDSSC.2017.8126506.J.
  • [5] A. Habib, M. J. Islam, and M. S. Rahman, ‘A dictionary-based text compression technique using quaternary code’, Iran J. Comput. Sci., vol. 3, no. 3, pp. 127–136, Sep. 2020, doi: 10.1007/s42044-019-00047-w.
  • [6] S. S and R. L, ‘Text Compression Algorithms - a Comparative Study’, ICTACT J. Commun. Technol., vol. 02, no. 04, pp. 444–451, 2011, doi: 10.21917/ijct.2011.0062.
  • [7] M. A. Rahman and M. Hamada, ‘Burrows–wheeler transform based lossless text compression using keys and Huffman coding’, Symmetry (Basel)., vol. 12, no. 10, pp. 1–14, Oct. 2020, doi: 10.3390/sym12101654.
  • [8] L. Barua, P. K. Dhar, L. Alam, and I. Echizen, ‘Bangla text compression based on modified lempel-Ziv-welch algorithm’, ECCE 2017 - Int. Conf. Electr. Comput. Commun. Eng., pp. 855–859, Apr. 2017, doi: 10.1109/ECACE.2017.7913022.
  • [9] A. Fruchtman, Y. Gross, S. T. Klein, and D. Shapira, ‘Weighted Burrows–Wheeler Compression’, SN Comput. Sci., vol. 4, no. 3, pp. 1–12, Mar. 2023, doi: 10.1007/s42979-022-01629-5.
  • [10] K. Amusa, A. Adewusi, T. Erinosho, S. Salawu, and D. Odufejo, ‘On the application of wavelet transform and Huffman algorithm to Yorùbá language syntax text files compression’, Serbian J. Electr. Eng., vol. 19, no. 3, pp. 351–368, 2022, doi: 10.2298/sjee2203351a.
  • [11] S. Gupta, A. K. Yadav, D. Yadav, and B. Shukla, ‘A scalable approach for index compression using wavelet tree and LZW’, Int. J. Inf. Technol., vol. 14, no. 4, pp. 2191–2204, Jun. 2022, doi: 10.1007/s41870-022-00915-y.
  • [12] B. A. Wijaya, S. Siboro, M. Brutu, and Y. K. Lase, ‘Application of Huffman Algorithm and Unary Codes for Text File Compression’, SinkrOn, vol. 7, no. 3, pp. 1000–1007, Jul. 2022, doi: 10.33395/sinkron.v7i3.11567.
  • [13] S. Kumar and A. Kumar Chaturvedi, ‘A Generalized Digital Database Text Compression Scheme Compared Wıth Ascii’, Int. J. Adv. Technol. Eng. Res., vol. 11, no. 2, p. 12, 2021, Accessed: Mar. 29, 2023. [Online]. Available: www.ijater.com
  • [14] M. A. Rahman and M. Hamada, ‘Lossless text compression using GPT-2 language model and Huffman coding’, SHS Web Conf., vol. 102, p. 04013, 2021, doi: 10.1051/shsconf/202110204013.
  • [15] P. Sarker and M. L. Rahman, ‘Introduction to Adjacent Distance Array with Huffman Principle: A New Encoding and Decoding Technique for Transliteration Based Bengali Text Compression’, Adv. Intell. Syst. Comput., vol. 1299 AISC, pp. 543–555, 2021, doi: 10.1007/978-981-33-4299-6_45.
  • [16] S. Haldar-Iversen, ‘Improving the text compression ratio for ASCII text Using a combination of dictionary coding , ASCII compression , and Huffman coding’, no. November, Nov. 2020, Accessed: Mar. 29, 2023. [Online]. Available: https://munin.uit.no/handle/10037/20517
  • [17] M. B. Ibrahim and K. A. Gbolagade, ‘Performance Comparison of Huffman Coding and Lempel-Ziv-Welch Text Compression Algorithms With Chinese Remainder Theorem’, Univ. Pitesti Sci. Bull. Ser. Electron. Comput. Sci., vol. 19, no. 2, pp. 7–12, Dec. 2019, Accessed: Mar. 29, 2023. [Online]. Available: http://bulletin.feccupit.ro/archive/view/2019_2_2.html
  • [18] M. S. Reza, S. A. Riya, S. A. Alam, and M. A. A. Hossain, ‘Study on Text Compression’, Feb. 2019, Accessed: Mar. 29, 2023. [Online]. Available: http://dspace.uiu.ac.bd/handle/52243/822
  • [19] F. BULUT, ‘Huffman Algoritmasıyla Kayıpsız Hızlı Metin Sıkıştırma’, El-Cezeri Fen ve Mühendislik Derg., vol. 3, no. 2, May 2016, doi: 10.31202/ecjse.264192.
  • [20] T. A. Rincy and R. Rajesh, ‘Preprocessed text compression method for Malayalam text files’, Int. J. Recent Technol. Eng., vol. 8, no. 2, pp. 1011–1015, 2019, doi: 10.35940/ijrte.B1806.078219.
  • [21] R. N. Horspool and G. V. Cormack, ‘Constructing word-based text compression algorithms’, Data Compression Conf. Proc., vol. 1992-March, pp. 62–71, 1992, doi: 10.1109/DCC.1992.227475.
  • [22] B. Eren, Ü. Fen, B. Dergisi, and S. Keser, ‘An Image Compression Method Based on Subspace and Downsampling’, Bitlis Eren Üniversitesi Fen Bilim. Derg., vol. 12, no. 1, pp. 215–225, Mar. 2023, doi: 10.17798/BITLISFEN.1225312.
  • [23] I. F. Ince, F. Bulut, I. Kilic, M. E. Yildirim, and O. F. Ince, ‘Low dynamic range discrete cosine transform (LDR-DCT) for high-performance JPEG image compression’, Vis. Comput., vol. 38, no. 5, pp. 1845–1870, May 2022, doi: 10.1007/S00371-022-02418-0/FIGURES/3.
  • [24] M. ASLANYÜREK and A. MESUT, ‘Kümeleme Performansını Ölçmek için Yeni Bir Yöntem ve Metin Kümeleme için Değerlendirmesi’, Eur. J. Sci. Technol., no. 27, pp. 53–65, 2021, doi: 10.31590/ejosat.932938.
  • [25] R. Leelavathi and M. N. Giri Prasad, ‘High-Capacity Reversible Data Hiding Using Lossless LZW Compression’, EAI/Springer Innov. Commun. Comput., pp. 517–528, 2022, doi: 10.1007/978-3-030-86165-0_44.
  • [26] J. R. Jayapandiyan, C. Kavitha, and K. Sakthivel, ‘Optimal Secret Text Compression Technique for Steganographic Encoding by Dynamic Ranking Algorithm’, J. Phys. Conf. Ser., vol. 1427, no. 1, p. 012005, Jan. 2020, doi: 10.1088/1742-6596/1427/1/012005.
  • [27] M. M. Aşşık and M. Oral, ‘Kanonik Huffman kod sözcükleri uzunluklarının evrim stratejileri algoritması ile belirlenmesi’, Gazi Üniversitesi Mühendislik-Mimarlık Fakültesi Derg., vol. 38, no. 2, pp. 771–780, 2022, doi: 10.17341/gazimmfd.882745.
  • [28] M. Varol Arısoy, ‘LZW-CIE: a high-capacity linguistic steganography based on LZW char index encoding’, Neural Comput. Appl., vol. 34, no. 21, pp. 19117–19145, Nov. 2022, doi: 10.1007/s00521-022-07499-5.
  • [29] D. Zhang, Q. Liu, Y. Wu, Y. Li, and L. Xiao, ‘Compression and indexing based on BWT: A surveyZhang, D., Liu, Q., Wu, Y., Li, Y., & Xiao, L. (2013). Compression and indexing based on BWT: A survey. Proceedings - 2013 10th Web Information System and Application Conference, WISA 2013, 61–64. https://doi’, Proc. - 2013 10th Web Inf. Syst. Appl. Conf. WISA 2013, pp. 61–64, 2013, doi: 10.1109/WISA.2013.20.
  • [30] P. M. Fenwick, ‘The Burrows–Wheeler Transform for Block Sorting Text Compression: Principles and Improvements’, Comput. J., vol. 39, no. 9, pp. 731–740, Jan. 1996, doi: 10.1093/COMJNL/39.9.731.
  • [31] D. Kempa and T. Kociumaka, ‘Resolution of the burrows-wheeler transform conjecture’, Commun. ACM, vol. 65, no. 6, pp. 91–98, Jun. 2022, doi: 10.1145/3531445.
  • [32] ‘Alice’s Adventures in Wonderland dataset | Kaggle’. https://www.kaggle.com/datasets/roblexnana/alice-wonderland-dataset (accessed May 23, 2023).
Year 2023, Volume: 12 Issue: 3, 783 - 796, 28.09.2023
https://doi.org/10.17798/bitlisfen.1301546

Abstract

References

  • [1] M. Ignatoski, J. Lerga, L. Stanković, and M. Daković, ‘Comparison of entropy and dictionary based text compression in English, German, French, Italian, Czech, Hungarian, Finnish, and Croatian’, Mathematics, vol. 8, no. 7, p. 1059, Jul. 2020, doi: 10.3390/MATH8071059.
  • [2] I. B. Ginzburg, S. N. Padalko, and M. N. Terentiev, ‘Short Message Compression Scheme for Wireless Sensor Networks’, Moscow Work. Electron. Netw. Technol. MWENT 2020 - Proc., Mar. 2020, doi: 10.1109/MWENT47943.2020.9067371.
  • [3] M. R. Hasan, ‘Data Compression using Huffman based LZW Encoding Technique’, Int. J. Sci. Eng. Res., vol. Volume 2, no. 11, pp. 1–7, 2011, Accessed: Mar. 20, 2023. [Online]. Available: http://www.ijser.org
  • [4] V. Ratnam Anappindi, ‘Issue 8 www.jetir.org (ISSN-2349-5162)’, JETIREZ06012 J. Emerg. Technol. Innov. Res., vol. 8, 2021, doi: 10.1109/EDSSC.2017.8126506.J.
  • [5] A. Habib, M. J. Islam, and M. S. Rahman, ‘A dictionary-based text compression technique using quaternary code’, Iran J. Comput. Sci., vol. 3, no. 3, pp. 127–136, Sep. 2020, doi: 10.1007/s42044-019-00047-w.
  • [6] S. S and R. L, ‘Text Compression Algorithms - a Comparative Study’, ICTACT J. Commun. Technol., vol. 02, no. 04, pp. 444–451, 2011, doi: 10.21917/ijct.2011.0062.
  • [7] M. A. Rahman and M. Hamada, ‘Burrows–wheeler transform based lossless text compression using keys and Huffman coding’, Symmetry (Basel)., vol. 12, no. 10, pp. 1–14, Oct. 2020, doi: 10.3390/sym12101654.
  • [8] L. Barua, P. K. Dhar, L. Alam, and I. Echizen, ‘Bangla text compression based on modified lempel-Ziv-welch algorithm’, ECCE 2017 - Int. Conf. Electr. Comput. Commun. Eng., pp. 855–859, Apr. 2017, doi: 10.1109/ECACE.2017.7913022.
  • [9] A. Fruchtman, Y. Gross, S. T. Klein, and D. Shapira, ‘Weighted Burrows–Wheeler Compression’, SN Comput. Sci., vol. 4, no. 3, pp. 1–12, Mar. 2023, doi: 10.1007/s42979-022-01629-5.
  • [10] K. Amusa, A. Adewusi, T. Erinosho, S. Salawu, and D. Odufejo, ‘On the application of wavelet transform and Huffman algorithm to Yorùbá language syntax text files compression’, Serbian J. Electr. Eng., vol. 19, no. 3, pp. 351–368, 2022, doi: 10.2298/sjee2203351a.
  • [11] S. Gupta, A. K. Yadav, D. Yadav, and B. Shukla, ‘A scalable approach for index compression using wavelet tree and LZW’, Int. J. Inf. Technol., vol. 14, no. 4, pp. 2191–2204, Jun. 2022, doi: 10.1007/s41870-022-00915-y.
  • [12] B. A. Wijaya, S. Siboro, M. Brutu, and Y. K. Lase, ‘Application of Huffman Algorithm and Unary Codes for Text File Compression’, SinkrOn, vol. 7, no. 3, pp. 1000–1007, Jul. 2022, doi: 10.33395/sinkron.v7i3.11567.
  • [13] S. Kumar and A. Kumar Chaturvedi, ‘A Generalized Digital Database Text Compression Scheme Compared Wıth Ascii’, Int. J. Adv. Technol. Eng. Res., vol. 11, no. 2, p. 12, 2021, Accessed: Mar. 29, 2023. [Online]. Available: www.ijater.com
  • [14] M. A. Rahman and M. Hamada, ‘Lossless text compression using GPT-2 language model and Huffman coding’, SHS Web Conf., vol. 102, p. 04013, 2021, doi: 10.1051/shsconf/202110204013.
  • [15] P. Sarker and M. L. Rahman, ‘Introduction to Adjacent Distance Array with Huffman Principle: A New Encoding and Decoding Technique for Transliteration Based Bengali Text Compression’, Adv. Intell. Syst. Comput., vol. 1299 AISC, pp. 543–555, 2021, doi: 10.1007/978-981-33-4299-6_45.
  • [16] S. Haldar-Iversen, ‘Improving the text compression ratio for ASCII text Using a combination of dictionary coding , ASCII compression , and Huffman coding’, no. November, Nov. 2020, Accessed: Mar. 29, 2023. [Online]. Available: https://munin.uit.no/handle/10037/20517
  • [17] M. B. Ibrahim and K. A. Gbolagade, ‘Performance Comparison of Huffman Coding and Lempel-Ziv-Welch Text Compression Algorithms With Chinese Remainder Theorem’, Univ. Pitesti Sci. Bull. Ser. Electron. Comput. Sci., vol. 19, no. 2, pp. 7–12, Dec. 2019, Accessed: Mar. 29, 2023. [Online]. Available: http://bulletin.feccupit.ro/archive/view/2019_2_2.html
  • [18] M. S. Reza, S. A. Riya, S. A. Alam, and M. A. A. Hossain, ‘Study on Text Compression’, Feb. 2019, Accessed: Mar. 29, 2023. [Online]. Available: http://dspace.uiu.ac.bd/handle/52243/822
  • [19] F. BULUT, ‘Huffman Algoritmasıyla Kayıpsız Hızlı Metin Sıkıştırma’, El-Cezeri Fen ve Mühendislik Derg., vol. 3, no. 2, May 2016, doi: 10.31202/ecjse.264192.
  • [20] T. A. Rincy and R. Rajesh, ‘Preprocessed text compression method for Malayalam text files’, Int. J. Recent Technol. Eng., vol. 8, no. 2, pp. 1011–1015, 2019, doi: 10.35940/ijrte.B1806.078219.
  • [21] R. N. Horspool and G. V. Cormack, ‘Constructing word-based text compression algorithms’, Data Compression Conf. Proc., vol. 1992-March, pp. 62–71, 1992, doi: 10.1109/DCC.1992.227475.
  • [22] B. Eren, Ü. Fen, B. Dergisi, and S. Keser, ‘An Image Compression Method Based on Subspace and Downsampling’, Bitlis Eren Üniversitesi Fen Bilim. Derg., vol. 12, no. 1, pp. 215–225, Mar. 2023, doi: 10.17798/BITLISFEN.1225312.
  • [23] I. F. Ince, F. Bulut, I. Kilic, M. E. Yildirim, and O. F. Ince, ‘Low dynamic range discrete cosine transform (LDR-DCT) for high-performance JPEG image compression’, Vis. Comput., vol. 38, no. 5, pp. 1845–1870, May 2022, doi: 10.1007/S00371-022-02418-0/FIGURES/3.
  • [24] M. ASLANYÜREK and A. MESUT, ‘Kümeleme Performansını Ölçmek için Yeni Bir Yöntem ve Metin Kümeleme için Değerlendirmesi’, Eur. J. Sci. Technol., no. 27, pp. 53–65, 2021, doi: 10.31590/ejosat.932938.
  • [25] R. Leelavathi and M. N. Giri Prasad, ‘High-Capacity Reversible Data Hiding Using Lossless LZW Compression’, EAI/Springer Innov. Commun. Comput., pp. 517–528, 2022, doi: 10.1007/978-3-030-86165-0_44.
  • [26] J. R. Jayapandiyan, C. Kavitha, and K. Sakthivel, ‘Optimal Secret Text Compression Technique for Steganographic Encoding by Dynamic Ranking Algorithm’, J. Phys. Conf. Ser., vol. 1427, no. 1, p. 012005, Jan. 2020, doi: 10.1088/1742-6596/1427/1/012005.
  • [27] M. M. Aşşık and M. Oral, ‘Kanonik Huffman kod sözcükleri uzunluklarının evrim stratejileri algoritması ile belirlenmesi’, Gazi Üniversitesi Mühendislik-Mimarlık Fakültesi Derg., vol. 38, no. 2, pp. 771–780, 2022, doi: 10.17341/gazimmfd.882745.
  • [28] M. Varol Arısoy, ‘LZW-CIE: a high-capacity linguistic steganography based on LZW char index encoding’, Neural Comput. Appl., vol. 34, no. 21, pp. 19117–19145, Nov. 2022, doi: 10.1007/s00521-022-07499-5.
  • [29] D. Zhang, Q. Liu, Y. Wu, Y. Li, and L. Xiao, ‘Compression and indexing based on BWT: A surveyZhang, D., Liu, Q., Wu, Y., Li, Y., & Xiao, L. (2013). Compression and indexing based on BWT: A survey. Proceedings - 2013 10th Web Information System and Application Conference, WISA 2013, 61–64. https://doi’, Proc. - 2013 10th Web Inf. Syst. Appl. Conf. WISA 2013, pp. 61–64, 2013, doi: 10.1109/WISA.2013.20.
  • [30] P. M. Fenwick, ‘The Burrows–Wheeler Transform for Block Sorting Text Compression: Principles and Improvements’, Comput. J., vol. 39, no. 9, pp. 731–740, Jan. 1996, doi: 10.1093/COMJNL/39.9.731.
  • [31] D. Kempa and T. Kociumaka, ‘Resolution of the burrows-wheeler transform conjecture’, Commun. ACM, vol. 65, no. 6, pp. 91–98, Jun. 2022, doi: 10.1145/3531445.
  • [32] ‘Alice’s Adventures in Wonderland dataset | Kaggle’. https://www.kaggle.com/datasets/roblexnana/alice-wonderland-dataset (accessed May 23, 2023).
There are 32 citations in total.

Details

Primary Language English
Subjects Engineering
Journal Section Araştırma Makalesi
Authors

Serkan Keskin 0000-0001-9404-5039

Onur Sevli 0000-0002-8933-8395

Ersan Okatan 0000-0001-6511-3450

Early Pub Date September 23, 2023
Publication Date September 28, 2023
Submission Date May 24, 2023
Acceptance Date September 4, 2023
Published in Issue Year 2023 Volume: 12 Issue: 3

Cite

IEEE S. Keskin, O. Sevli, and E. Okatan, “Single and Binary Performance Comparison of Data Compression Algorithms for Text Files”, Bitlis Eren Üniversitesi Fen Bilimleri Dergisi, vol. 12, no. 3, pp. 783–796, 2023, doi: 10.17798/bitlisfen.1301546.

Bitlis Eren University
Journal of Science Editor
Bitlis Eren University Graduate Institute
Bes Minare Mah. Ahmet Eren Bulvari, Merkez Kampus, 13000 BITLIS