Evaluating Text Preprocessing Methods for Discovering Quality Topics to
Improve the Information Retrieval Mechanism

Lakshmi Sonkusale1, Krishna Kumar Chaturvedi2*, Anu Sharma2,
Achal Lama3, Mohammad Samir Farooqi5, Shashi Bhushan Lal2,
Pratibha Joshi4, Dwijesh Chandra Mishra5 and Murari Kumar1

Acta Scientific COMPUTER SCIENCES

Research Article Volume 5 Issue 9

Evaluating Text Preprocessing Methods for Discovering Quality Topics to Improve the Information Retrieval Mechanism

Lakshmi Sonkusale¹, Krishna Kumar Chaturvedi²*, Anu Sharma², Shashi Bhushan Lal², Mohammad Samir Farooqi³, Achal Lama⁴, Dwijesh Chandra Mishra⁴, Pratibha Joshi⁵, Murari Kumar¹

¹Ph.D. Scholar, The Graduate School, ICAR-Indian Agricultural Research Institute, New Delhi, India
²Principal Scientist, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
³Senior Scientist, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
⁴Scientist, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
⁵Scientist, ICAR-Indian Agricultural Research Institute, New Delhi, India

*Corresponding Author: Krishna Kumar Chaturvedi, Principal Scientist, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India.

Received: July 24, 2023; Published: August 14, 2023

Reprints View PDF Related Articles

Abstract

Topic discovery is the innovation towards extracting the underlying semantic structure from large collection of unstructured text. It is a convenient way to analyze unclassified text into topic clusters that can be utilized in classification of documents. A topic contains a set of words that frequently occurs together and defines the complete text into specific category. Topic discovery can group words with similar meaning and distinguish between uses of words with multiple meaning. It is an important and challenging task useful in information retrieval process. This paper discusses different preprocessing methods of text mining by using Latent Dirichlet Allocation (LDA) in determining number of topics. This will help in developing new computational methods to identify topics from text dataset. The LDA is a statistical modelling approach to analyse unclassified text into useful topics. In this study, the effect of text preprocessing methods on collected research articles for obtaining quality topics by applying grid search method for hyperparameters optimization are explored and evaluated using coherence score and topic score. The study suggests that preprocessing affects the number of topics and quality of these topics. The findings of the study will help in enhancing the information retrieval mechanism based of the identified topics and also useful in recommending related research articles to the researchers.

Keywords: Topic Model; Hyperparameters; Topic Discovery; Latent Dirichlet Allocation (LDA); Grid Search

References

Barde BV and Bainwad AM. "An overview of topic modeling methods and tools". In Proceedings of the International Conference on Intelligent Computing and Control Systems (2017): 745-750. IEEE.
Baumer Eric PS., et al. "Comparing grounded theory and topic modeling: Extreme divergence or unlikely convergence?". Journal of the Association for Information Science and Technology6 (2017): 1397-1410.
Bellaouar S., et al. "Topic modeling: Comparison of LSA and LDA on scientific publications". 2021 4^th International Conference on Data Storage and Data Engineering (2021): 59-64.
Blei D M and Jordan M I. “Modeling annotated data”. In Proceedings of the 26^th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2003): 127-134.
Deerwester S., et al. “Indexing by latent semantic analysis”. Journal of the American Society for Information Science6 (1990): 391-407.
Gupta R K., et al. “Prediction of Research Trends using LDA based Topic Modeling”. Global Transitions Proceedings 3.1 (2022): 298-304.
Hofmann T. “Probabilistic latent semantic indexing”. In Proceedings of the 22^nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1999): 50-57.
Hong L and Davison BD. “Empirical study of topic modeling in twitter”. In Proceedings of the first workshop on social media analytics (2010): 80-88.
Hurtado J L., et al. “Topic discovery and future trend forecasting for texts”. Journal of Big Data 1 (2016): 1-21.
Jelodar H., et al. “Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey”. Multimedia Tools and Applications11 (2019): 15169-15211.
Kherwa P and Bansal P. “Topic modeling: a comprehensive review”. EAI Endorsed Transactions on Scalable Information Systems24 (2019).
Lee N., et al. “Combining TF-IDF and LDA to generate flexible communication for recommendation services by a humanoid robot”. Multimedia Tools and Applications4 (2018): 5043-5058.
Mimno D., et al. “Optimizing semantic coherence in topic models”. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (2011): 262-272.
Murakami A., et al. “What is this corpus about?’ using topic modelling to explore a specialised corpus”. Corpora2 (2017): 243-277.
Purver M., et al. “Unsupervised topic modelling for multi-party spoken discourse”. In Proceedings of the 21^st International Conference on Computational Linguistics and 44^th Annual Meeting of the Association for Computational Linguistics (2006): 17-24.
Order M., et al. “Exploring the space of topic coherence measures”. In proceedings of the 8^th ACM International Conference on Web Search and Data Mining (2015): 399-408.
Sonkusale, L., et al. “Exploring the Applicability of Topic Modeling in SARS-CoV-2 Literature and Impact on Agriculture”. Indian Research Journal of Extension Education 22.4 (2022): 48-56.
Steyvers M and Griffiths T. “Probabilistic topic models”. In Handbook of latent semantic analysis (2007): 439-460.
Syed S and Spruit M. “Full-text or abstract? examining topic coherence scores using latent dirichlet allocation”. In proceedings of the IEEE International Conference on Data Science and Advanced Analytics, (2017): 165-174.
Zhao W., et al. “A heuristic approach to determine an appropriate number of topics in topic modeling”. BMC Bioinformatics 16.13 (2015): 1-10.

Citation

Citation: Krishna Kumar Chaturvedi., et al. “Evaluating Text Preprocessing Methods for Discovering Quality Topics to Improve the Information Retrieval Mechanism".Acta Scientific Computer Sciences 5.9 (2023): 03-08.

Copyright

Copyright: © 2023 Krishna Kumar Chaturvedi., et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

+91-91548-70066

Journal Menu

Metrics

Acceptance rate35%

Acceptance to publication20-30 days

Indexed In

News and Events

Publication Certificate
Authors will be provided with the Publication Certificate after their successful publication
Last Date for submission
Authors are requested to submit manuscripts on/before July 27, 2026, for the upcoming issue of 2026.

Acta Scientific COMPUTER SCIENCES

Research Article Volume 5 Issue 9

Abstract

References

Citation

Copyright

+91-91548-70066

Journal Menu

Metrics

Indexed In

News and Events

Contact US

Acta-Publications

Acta Scientific COMPUTER SCIENCES

Research Article Volume 5 Issue 9

Abstract

References

Citation

Copyright

+91-91548-70066

Journal Menu

Metrics

Indexed In

Subscribe to our newsletter

News and Events

Contact US

Acta-Publications

Follow Us On