A stemming procedure and stopword list for general French corpora
Abstract
Due to the increasing use of network‐based systems, there is a growing interest in access to and search mechanisms for text databases in languages other than English. To adapt searching systems to those foreign languages with characteristics similar to the English language, all we need to do for the most part is to establish a general stopword list and a stemming procedure. This article presents the tools needed to establish these in the French language databases and some retrieval experiments that have been carried out using two medium‐sized French language test collections. These experiments were conducted to evaluate the retrieval effectiveness of the propositions described.
Citing Literature
Number of times cited according to CrossRef: 54
- Driss Namly, Karim Bouzoubaa, Rachida Tajmout, Ali Laadimi, On Arabic Stop-Words: A Comprehensive List and a Dedicated Morphological Analyzer, Arabic Language Processing: From Theory to Practice, 10.1007/978-3-030-32959-4_11, (149-163), (2019).
- A.A.V.A Jayaweera, Y.N Senanayake, Prasanna S. Haddela, undefined, 2019 National Information Technology Conference (NITC), 10.1109/NITC48475.2019.9114476, (1-6), (2019).
- Sarah Zenasni, Eric Kergosien, Mathieu Roche, Maguelonne Teisseire, Spatial Information Extraction from Short Messages, Expert Systems with Applications, 10.1016/j.eswa.2017.11.025, 95, (351-367), (2018).
- Indira Syawanodya, Arief Fatchul Huda, undefined, 2018 Third International Conference on Informatics and Computing (ICIC), 10.1109/IAC.2018.8780450, (1-5), (2018).
- V. Jouhet, G. Defossez, A. Burgun, P. le Beux, P. Levillain, P. Ingrand, V. Claveau, Automated Classification of Free-text Pathology Reports for Registration of Incident Cases of Cancer, Methods of Information in Medicine, 10.3414/ME11-01-0005, 51, 03, (242-251), (2018).
- Pavlina Fragkou, Applying named entity recognition and co-reference resolution for segmenting English texts, Progress in Artificial Intelligence, 10.1007/s13748-017-0127-3, 6, 4, (325-346), (2017).
- Masoud Makrehchi, Mohamed S. Kamel, Extracting domain-specific stopwords for text classifiers, Intelligent Data Analysis, 10.3233/IDA-150390, 21, 1, (39-62), (2017).
- Jasmeet Singh, Vishal Gupta, A systematic review of text stemming techniques, Artificial Intelligence Review, 10.1007/s10462-016-9498-2, 48, 2, (157-217), (2016).
- undefined, Proceedings of the 8th International Conference on Management of Digital EcoSystems - MEDES, 10.1145/3012071.3012079, (189-196), (2016).
- Felipe N. Flores, Viviane P. Moreira, Assessing the impact of Stemming Accuracy on Information Retrieval – A multilingual perspective, Information Processing & Management, 10.1016/j.ipm.2016.03.004, 52, 5, (840-854), (2016).
- Khalifa Chekima, Rayner Alfred, An Automatic Construction of Malay Stop Words Based on Aggregation Method, Soft Computing in Data Science, 10.1007/978-981-10-2777-2_16, (180-189), (2016).
- Jasmeet Singh, Vishal Gupta, Text Stemming, ACM Computing Surveys, 10.1145/2975608, 49, 3, (1-46), (2016).
- Jasleen Kaur, Jatinderkumar R. Saini, undefined, Proceedings of the ACM Symposium on Women in Research 2016 - WIR '16, 10.1145/2909067.2909073, (32-37), (2016).
- M. Thangarasu, H. Hannah Inbarani, Analogy Removal Stemmer Algorithm for Tamil Text Corpora, Digital Connectivity – Social Impact, 10.1007/978-981-10-3274-5_6, (70-81), (2016).
- Suliana Sulaiman, Khairuddin Omar, Nazlia Omar, Mohd Zamri Murah, Hamdan Abdul Rahman, The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents in Jawi Characters, ACM Transactions on Asian Language Information Processing, 10.1145/2540988, 13, 2, (1-21), (2014).
- Navanath Saharia, Utpal Sharma, Jugal Kalita, Stemming resource-poor Indian languages, ACM Transactions on Asian Language Information Processing, 10.1145/2629670, 13, 3, (1-26), (2014).
- Kyumars Sheykh Esmaili, Shahin Salavati, Anwitaman Datta, Towards Kurdish Information Retrieval, ACM Transactions on Asian Language Information Processing, 10.1145/2556948, 13, 2, (1-18), (2014).
- Hisatsugu KOKUBU, Haruko YAMAZAKI, Masashi NOSAKA, Japanese Stopword List Making for Keyword Extraction Suitable for Semantic Interpretation, Transactions of Japan Society of Kansei Engineering, 10.5057/jjske.12.511, 12, 4, (511-518), (2013).
- Víctor Peinado, Álvaro Rodrigo, Fernando López-Ostenero, Multilingual Information Access, Emerging Applications of Natural Language Processing, 10.4018/978-1-4666-2169-5, (203-228), (2013).
- Johannes Leveling, Debasis Ganguly, Gareth J. F. Jones, Term Conflation and Blind Relevance Feedback for Information Retrieval on Indian Languages, Multilingual Information Access in South Asian Languages, 10.1007/978-3-642-40087-2_28, (295-309), (2013).
- Jian-Yun Nie, Cross-Language Information Retrieval, Synthesis Lectures on Human Language Technologies, 10.2200/S00266ED1V01Y201005HLT008, 3, 1, (1-125), (2010).
- Nicola Ferro, Donna Harman, CLEF 2009: Grid@CLEF Pilot Track Overview, Multilingual Information Access Evaluation I. Text Retrieval Experiments, 10.1007/978-3-642-15754-7_68, (552-565), (2010).
- undefined Gong Zheng, undefined Guan Gaowa, undefined, 2010 IEEE International Conference on Intelligent Computing and Intelligent Systems, 10.1109/ICICISYS.2010.5658841, (71-74), (2010).
- Ljiljana Dolamic, Jacques Savoy, When stopword lists make the difference, Journal of the American Society for Information Science and Technology, 10.1002/asi.21186, 61, 1, (200-203), (2009).
- Fotis Lazarinis, Jesús Vilares, John Tait, Efthimis N. Efthimiadis, Current research issues and trends in non-English Web searching, Information Retrieval, 10.1007/s10791-009-9093-0, 12, 3, (230-250), (2009).
- Jens Kürsten, Thomas Wilhelm, Maximilian Eibl, CLEF 2008 Ad-Hoc Track: Comparing and Combining Different IR Approaches, Evaluating Systems for Multilingual and Multimodal Information Access, 10.1007/978-3-642-04447-2_8, (75-82), (2009).
- Thomas Mandl, Technologies for Information Access and Knowledge Management, Encyclopedia of Information Science and Technology, Second Edition, 10.4018/978-1-60566-026-4, (3680-3685), (2009).
- Amaresh Kumar Pandey, Tanvver J Siddiqui, Evaluating Effect of Stemming and Stop-word Removal on Hindi Text Retrieval, Proceedings of the First International Conference on Intelligent Human Computer Interaction, 10.1007/978-81-8489-203-1, (316-326), (2009).
- J. Šnajder, B. Dalbelo Bašić, M. Tadić, Automatic acquisition of inflectional lexica for morphological normalisation, Information Processing & Management, 10.1016/j.ipm.2008.03.006, 44, 5, (1720-1731), (2008).
- Jacques Savoy, Searching strategies for the Hungarian language, Information Processing & Management, 10.1016/j.ipm.2007.01.022, 44, 1, (310-324), (2008).
- Masoud Makrehchi, Mohamed S. Kamel, Automatic Extraction of Domain-Specific Stopwords from Labeled Documents, Advances in Information Retrieval, 10.1007/978-3-540-78646-7, (222-233), (2008).
- DIMITRIOS P. LYRAS, KYRIAKOS N. SGARBAS, NIKOLAOS D. FAKOTAKIS, APPLYING SIMILARITY MEASURES FOR AUTOMATIC LEMMATIZATION: A CASE STUDY FOR MODERN GREEK AND ENGLISH, International Journal on Artificial Intelligence Tools, 10.1142/S021821300800428X, 17, 05, (1043-1064), (2008).
- Ibrahim Abu El‐Khair, Arabic information retrieval, Annual Review of Information Science and Technology, 10.1002/aris.2007.1440410118, 41, 1, (505-533), (2008).
- Fazli Can, Seyit Kocberber, Erman Balcik, Cihan Kaynak, H. Cagdas Ocalan, Onur M. Vursavas, Information retrieval on Turkish texts, Journal of the American Society for Information Science and Technology, 10.1002/asi.20750, 59, 3, (407-421), (2007).
- Fotis Lazarinis, Engineering and utilizing a stopword list in Greek Web retrieval, Journal of the American Society for Information Science and Technology, 10.1002/asi.20648, 58, 11, (1645-1652), (2007).
- Per Ahlgren, Jaana Kekäläinen, Indexing strategies for Swedish full text retrieval under different user scenarios, Information Processing & Management, 10.1016/j.ipm.2006.03.003, 43, 1, (81-102), (2007).
- Jacques Savoy, Searching strategies for the Bulgarian language, Information Retrieval, 10.1007/s10791-007-9033-9, 10, 6, (509-529), (2007).
- Debasis Ganguly, Johannes Leveling, Gareth J. F. Jones, undefined, Proceedings of the 5th 2013 Forum on Information Retrieval Evaluation - FIRE '13, 10.1145/2701336.2701646, (1-5), (2007).
- Fabienne Moreau, Vincent Claveau, Pascale Sébillot, Automatic Morphological Query Expansion Using Analogy-Based Machine Learning, Advances in Information Retrieval, 10.1007/978-3-540-71496-5, (222-233), (2007).
- Massimo Melucci, Nicola Orio, Design, implementation, and evaluation of a methodology for automatic stemmer generation, Journal of the American Society for Information Science and Technology, 10.1002/asi.20509, 58, 5, (673-686), (2007).
- Per Ahlgren, Jaana Kekäläinen, Swedish full text retrieval: Effectiveness of different combinations of indexing strategies with query terms, Information Retrieval, 10.1007/s10791-006-9009-1, 9, 6, (681-697), (2006).
- Niels Jensen, René Hackl, Thomas Mandl, Robert Strötgen, Web Retrieval Experiments with the EuroGOV Corpus at the University of Hildesheim, Accessing Multilingual Information Repositories, 10.1007/11878773_91, (837-845), (2006).
- Kazuaki Kishida, Technical issues of cross-language information retrieval: a review, Information Processing & Management, 10.1016/j.ipm.2004.06.007, 41, 3, (433-455), (2005).
- Jacques Savoy, Bibliographic database access using free-text and controlled vocabulary: an evaluation, Information Processing & Management, 10.1016/j.ipm.2004.01.004, 41, 4, (873-890), (2005).
- Laura Perret, A Question Answering System for French, Multilingual Information Access for Text, Speech and Images, 10.1007/11519645_39, (392-403), (2005).
- Monica Rogati, Yiming Yang, Multilingual Information Retrieval Using Open, Transparent Resources in CLEF 2003, Comparative Evaluation of Multilingual Information Access Systems, 10.1007/978-3-540-30222-3_12, (133-139), (2004).
- Imad A. Al‐Sughaiyer, Ibrahim A. Al‐Kharashi, Arabic morphological analysis techniques: A comprehensive survey, Journal of the American Society for Information Science and Technology, 10.1002/asi.10368, 55, 3, (189-213), (2003).
- Jacques Savoy, Cross-language information retrieval: experiments based on CLEF 2000 corpora, Information Processing & Management, 10.1016/S0306-4573(02)00018-3, 39, 1, (75-115), (2003).
- Martin Braschler, Bärbel Ripplinger, Stemming and Decompounding for German Text Retrieval, Advances in Information Retrieval, 10.1007/3-540-36618-0_13, (177-192), (2003).
- Jacques Savoy, Report on CLEF 2002 Experiments: Combining Multiple Sources of Evidence, Advances in Cross-Language Information Retrieval, 10.1007/978-3-540-45237-9_6, (66-90), (2003).
- Jacques Savoy, Report on CLEF-2001 Experiments: Effective Combined Query-Translation Approach, Evaluation of Cross-Language Information Retrieval Systems, 10.1007/3-540-45691-0_3, (27-43), (2002).
- Carlos G. Figuerola, Raquel Gómez, Angel F. Zazo Rodríguez, José Luis Alonso Berrocal, Spanish Monolingual Track: The Impact of Stemming on Retrieval, Evaluation of Cross-Language Information Retrieval Systems, 10.1007/3-540-45691-0_23, (253-261), (2002).
- Eugenia Matoyo, Tony Valsamidis, Across the Bridge: CLEF 2001 — Non-english Monolingual Retrieval. The French Task, Evaluation of Cross-Language Information Retrieval Systems, 10.1007/3-540-45691-0_27, (291-299), (2002).
- Jacques Savoy, Justin Picard, Retrieval effectiveness on the web, Information Processing & Management, 10.1016/S0306-4573(00)00039-X, 37, 4, (543-569), (2001).




