Volume 50, Issue 10
Research

A stemming procedure and stopword list for general French corpora

Jacques Savoy

E-mail address: Jacques.Savoy@seco.unine.ch

Institut interfacultaire d'informatique, Université de Neuchâtel, Pierre‐à‐Mazel 7, CH ‐ 2000 Neuchâtel, Switzerland

Search for more papers by this author

Abstract

Due to the increasing use of network‐based systems, there is a growing interest in access to and search mechanisms for text databases in languages other than English. To adapt searching systems to those foreign languages with characteristics similar to the English language, all we need to do for the most part is to establish a general stopword list and a stemming procedure. This article presents the tools needed to establish these in the French language databases and some retrieval experiments that have been carried out using two medium‐sized French language test collections. These experiments were conducted to evaluate the retrieval effectiveness of the propositions described.

Number of times cited according to CrossRef: 54

  • On Arabic Stop-Words: A Comprehensive List and a Dedicated Morphological Analyzer, Arabic Language Processing: From Theory to Practice, 10.1007/978-3-030-32959-4_11, (149-163), (2019).
  • undefined, 2019 National Information Technology Conference (NITC), 10.1109/NITC48475.2019.9114476, (1-6), (2019).
  • Spatial Information Extraction from Short Messages, Expert Systems with Applications, 10.1016/j.eswa.2017.11.025, 95, (351-367), (2018).
  • undefined, 2018 Third International Conference on Informatics and Computing (ICIC), 10.1109/IAC.2018.8780450, (1-5), (2018).
  • Automated Classification of Free-text Pathology Reports for Registration of Incident Cases of Cancer, Methods of Information in Medicine, 10.3414/ME11-01-0005, 51, 03, (242-251), (2018).
  • Applying named entity recognition and co-reference resolution for segmenting English texts, Progress in Artificial Intelligence, 10.1007/s13748-017-0127-3, 6, 4, (325-346), (2017).
  • Extracting domain-specific stopwords for text classifiers, Intelligent Data Analysis, 10.3233/IDA-150390, 21, 1, (39-62), (2017).
  • A systematic review of text stemming techniques, Artificial Intelligence Review, 10.1007/s10462-016-9498-2, 48, 2, (157-217), (2016).
  • undefined, Proceedings of the 8th International Conference on Management of Digital EcoSystems - MEDES, 10.1145/3012071.3012079, (189-196), (2016).
  • Assessing the impact of Stemming Accuracy on Information Retrieval – A multilingual perspective, Information Processing & Management, 10.1016/j.ipm.2016.03.004, 52, 5, (840-854), (2016).
  • An Automatic Construction of Malay Stop Words Based on Aggregation Method, Soft Computing in Data Science, 10.1007/978-981-10-2777-2_16, (180-189), (2016).
  • Text Stemming, ACM Computing Surveys, 10.1145/2975608, 49, 3, (1-46), (2016).
  • undefined, Proceedings of the ACM Symposium on Women in Research 2016 - WIR '16, 10.1145/2909067.2909073, (32-37), (2016).
  • Analogy Removal Stemmer Algorithm for Tamil Text Corpora, Digital Connectivity – Social Impact, 10.1007/978-981-10-3274-5_6, (70-81), (2016).
  • The Effectiveness of a Jawi Stemmer for Retrieving Relevant Malay Documents in Jawi Characters, ACM Transactions on Asian Language Information Processing, 10.1145/2540988, 13, 2, (1-21), (2014).
  • Stemming resource-poor Indian languages, ACM Transactions on Asian Language Information Processing, 10.1145/2629670, 13, 3, (1-26), (2014).
  • Towards Kurdish Information Retrieval, ACM Transactions on Asian Language Information Processing, 10.1145/2556948, 13, 2, (1-18), (2014).
  • Japanese Stopword List Making for Keyword Extraction Suitable for Semantic Interpretation, Transactions of Japan Society of Kansei Engineering, 10.5057/jjske.12.511, 12, 4, (511-518), (2013).
  • Multilingual Information Access, Emerging Applications of Natural Language Processing, 10.4018/978-1-4666-2169-5, (203-228), (2013).
  • Term Conflation and Blind Relevance Feedback for Information Retrieval on Indian Languages, Multilingual Information Access in South Asian Languages, 10.1007/978-3-642-40087-2_28, (295-309), (2013).
  • Cross-Language Information Retrieval, Synthesis Lectures on Human Language Technologies, 10.2200/S00266ED1V01Y201005HLT008, 3, 1, (1-125), (2010).
  • CLEF 2009: Grid@CLEF Pilot Track Overview, Multilingual Information Access Evaluation I. Text Retrieval Experiments, 10.1007/978-3-642-15754-7_68, (552-565), (2010).
  • undefined, 2010 IEEE International Conference on Intelligent Computing and Intelligent Systems, 10.1109/ICICISYS.2010.5658841, (71-74), (2010).
  • When stopword lists make the difference, Journal of the American Society for Information Science and Technology, 10.1002/asi.21186, 61, 1, (200-203), (2009).
  • Current research issues and trends in non-English Web searching, Information Retrieval, 10.1007/s10791-009-9093-0, 12, 3, (230-250), (2009).
  • CLEF 2008 Ad-Hoc Track: Comparing and Combining Different IR Approaches, Evaluating Systems for Multilingual and Multimodal Information Access, 10.1007/978-3-642-04447-2_8, (75-82), (2009).
  • Technologies for Information Access and Knowledge Management, Encyclopedia of Information Science and Technology, Second Edition, 10.4018/978-1-60566-026-4, (3680-3685), (2009).
  • Evaluating Effect of Stemming and Stop-word Removal on Hindi Text Retrieval, Proceedings of the First International Conference on Intelligent Human Computer Interaction, 10.1007/978-81-8489-203-1, (316-326), (2009).
  • Automatic acquisition of inflectional lexica for morphological normalisation, Information Processing & Management, 10.1016/j.ipm.2008.03.006, 44, 5, (1720-1731), (2008).
  • Searching strategies for the Hungarian language, Information Processing & Management, 10.1016/j.ipm.2007.01.022, 44, 1, (310-324), (2008).
  • Automatic Extraction of Domain-Specific Stopwords from Labeled Documents, Advances in Information Retrieval, 10.1007/978-3-540-78646-7, (222-233), (2008).
  • APPLYING SIMILARITY MEASURES FOR AUTOMATIC LEMMATIZATION: A CASE STUDY FOR MODERN GREEK AND ENGLISH, International Journal on Artificial Intelligence Tools, 10.1142/S021821300800428X, 17, 05, (1043-1064), (2008).
  • Arabic information retrieval, Annual Review of Information Science and Technology, 10.1002/aris.2007.1440410118, 41, 1, (505-533), (2008).
  • Information retrieval on Turkish texts, Journal of the American Society for Information Science and Technology, 10.1002/asi.20750, 59, 3, (407-421), (2007).
  • Engineering and utilizing a stopword list in Greek Web retrieval, Journal of the American Society for Information Science and Technology, 10.1002/asi.20648, 58, 11, (1645-1652), (2007).
  • Indexing strategies for Swedish full text retrieval under different user scenarios, Information Processing & Management, 10.1016/j.ipm.2006.03.003, 43, 1, (81-102), (2007).
  • Searching strategies for the Bulgarian language, Information Retrieval, 10.1007/s10791-007-9033-9, 10, 6, (509-529), (2007).
  • undefined, Proceedings of the 5th 2013 Forum on Information Retrieval Evaluation - FIRE '13, 10.1145/2701336.2701646, (1-5), (2007).
  • Automatic Morphological Query Expansion Using Analogy-Based Machine Learning, Advances in Information Retrieval, 10.1007/978-3-540-71496-5, (222-233), (2007).
  • Design, implementation, and evaluation of a methodology for automatic stemmer generation, Journal of the American Society for Information Science and Technology, 10.1002/asi.20509, 58, 5, (673-686), (2007).
  • Swedish full text retrieval: Effectiveness of different combinations of indexing strategies with query terms, Information Retrieval, 10.1007/s10791-006-9009-1, 9, 6, (681-697), (2006).
  • Web Retrieval Experiments with the EuroGOV Corpus at the University of Hildesheim, Accessing Multilingual Information Repositories, 10.1007/11878773_91, (837-845), (2006).
  • Technical issues of cross-language information retrieval: a review, Information Processing & Management, 10.1016/j.ipm.2004.06.007, 41, 3, (433-455), (2005).
  • Bibliographic database access using free-text and controlled vocabulary: an evaluation, Information Processing & Management, 10.1016/j.ipm.2004.01.004, 41, 4, (873-890), (2005).
  • A Question Answering System for French, Multilingual Information Access for Text, Speech and Images, 10.1007/11519645_39, (392-403), (2005).
  • Multilingual Information Retrieval Using Open, Transparent Resources in CLEF 2003, Comparative Evaluation of Multilingual Information Access Systems, 10.1007/978-3-540-30222-3_12, (133-139), (2004).
  • Arabic morphological analysis techniques: A comprehensive survey, Journal of the American Society for Information Science and Technology, 10.1002/asi.10368, 55, 3, (189-213), (2003).
  • Cross-language information retrieval: experiments based on CLEF 2000 corpora, Information Processing & Management, 10.1016/S0306-4573(02)00018-3, 39, 1, (75-115), (2003).
  • Stemming and Decompounding for German Text Retrieval, Advances in Information Retrieval, 10.1007/3-540-36618-0_13, (177-192), (2003).
  • Report on CLEF 2002 Experiments: Combining Multiple Sources of Evidence, Advances in Cross-Language Information Retrieval, 10.1007/978-3-540-45237-9_6, (66-90), (2003).
  • Report on CLEF-2001 Experiments: Effective Combined Query-Translation Approach, Evaluation of Cross-Language Information Retrieval Systems, 10.1007/3-540-45691-0_3, (27-43), (2002).
  • Spanish Monolingual Track: The Impact of Stemming on Retrieval, Evaluation of Cross-Language Information Retrieval Systems, 10.1007/3-540-45691-0_23, (253-261), (2002).
  • Across the Bridge: CLEF 2001 — Non-english Monolingual Retrieval. The French Task, Evaluation of Cross-Language Information Retrieval Systems, 10.1007/3-540-45691-0_27, (291-299), (2002).
  • Retrieval effectiveness on the web, Information Processing & Management, 10.1016/S0306-4573(00)00039-X, 37, 4, (543-569), (2001).