A knowledge-based approach to Information Extraction for semantic interoperability in the archaeology domain
Andreas Vlachidis
Faculty of Computing, Engineering and Science, University of South Wales, Pontypridd, Wales, CF37 1DL UK
Search for more papers by this authorDouglas Tudhope
Faculty of Computing, Engineering and Science, University of South Wales, Pontypridd, Wales, CF37 1DL UK
Search for more papers by this authorAndreas Vlachidis
Faculty of Computing, Engineering and Science, University of South Wales, Pontypridd, Wales, CF37 1DL UK
Search for more papers by this authorDouglas Tudhope
Faculty of Computing, Engineering and Science, University of South Wales, Pontypridd, Wales, CF37 1DL UK
Search for more papers by this authorAbstract
The article presents a method for automatic semantic indexing of archaeological grey-literature reports using empirical (rule-based) Information Extraction techniques in combination with domain-specific knowledge organization systems. The semantic annotation system (OPTIMA) performs the tasks of Named Entity Recognition, Relation Extraction, Negation Detection, and Word-Sense Disambiguation using hand-crafted rules and terminological resources for associating contextual abstractions with classes of the standard ontology CIDOC Conceptual Reference Model (CRM) for cultural heritage and its archaeological extension, CRM-EH.
Relation Extraction (RE) performance benefits from a syntactic-based definition of RE patterns derived from domain oriented corpus analysis. The evaluation also shows clear benefit in the use of assistive natural language processing (NLP) modules relating to Word-Sense Disambiguation, Negation Detection, and Noun Phrase Validation, together with controlled thesaurus expansion.
The semantic indexing results demonstrate the capacity of rule-based Information Extraction techniques to deliver interoperable semantic abstractions (semantic annotations) with respect to the CIDOC CRM and archaeological thesauri. Major contributions include recognition of relevant entities using shallow parsing NLP techniques driven by a complimentary use of ontological and terminological domain resources and empirical derivation of context-driven RE rules for the recognition of semantic relationships from phrases of unstructured text.
References
- Ananiadou, S., Pysysalo, S., Tsujii, J., & Kell, D. (2010). Event extraction for systems biology by text mining the literature. Trends in Biotechnology, 28(7), 381–390.
- Andronikos Web Portal. (2012). Semantic Annotation of Archaeological Grey Literature. Retrieved September 1, 2014, Retrieved from http://www.andronikos.co.uk
- Bates, M. (1986). Subject access in online catalogs: A design model. Journal of the American Society for Information Science, 37(6), 357–376.
- Bontcheva, K., Tablan, V., Maynard, D., & Cunningham, H. (2004). Evolving GATE to meet new challenges in language engineering. Natural Language Engineering, 10(3/4), 349–373.
10.1017/S1351324904003468 Google Scholar
- Bontcheva, K., Duke, T., Glover, N., & Kings, I. (2006). Semantic Information Access. In J. Davies, R. Studer, & P. Warren (Eds.), Semantic web semantic web technology: Trends and research in ontology based systems (pp. 139–167). Chichester: John Wiley and Sons Ltd.
- Byrne, K. (2007). Nested named entity recognition in historical archive text. In Proceedings of the International Conference on Semantic Computing (ICSC 2007) (pp. 589–596). California: IEEE.
- Byrne, K., & Ewan, K. (2010). Automatic extraction of archaeological events from text. In B. Frischer, J. Crawford, & D. Koller (Eds.), Making history interactive: Computer applications and quantitative methods in archaeology (pp. 48–56). Oxford: Archaeopress.
- Chapman, W.W., Bridewell, W., Hanbury, P., Cooper, G.F., & Buchanan, B.G. (2001). A simple algorithm for identifying negated findings and diseases in discharge summaries. Journal of Biomedical Informatics, 34(5), 301–310.
- Cimiano, P., Reyle, U., & Saric, J. (2005). Ontology-driven discourse analysis for information extraction. Data and Knowledge Engineering, 55(1), 59–83.
- Cowie, J., & Lehnert, W. (1996). Information extraction. Communications ACM, 39(1), 80–91.
- Crofts, N., Doerr, M., Gill, T., Stead, S., & Stiff, M. (2009). Definition of the CIDOC Conceptual Reference Model, FORTH Greece. Retrieved September 1, 2014, Retrieved from http://cidoc.ics.forth.gr/docs/cidoc_crm_version_5.0.1_Mar09.pdf
- Cunningham, H., Maynard, D., & Tablan, V. (2000). JAPE a Java Annotation Patterns Engine (Second Edition). Technical report CS–00–10, University of Sheffield, Department of Computer Science. Retrieved September 1, 2014, Retrieved from http://www.dcs.shef.ac.uk/intranet/research/resmes/CS0010.pdf
- Cunningham, H., Maynard, D., Bontcheva, K., & Tablan, V. (2002). GATE: A framework and graphical development environment for robust NLP tools and applications, Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL2002). Stroudsburg, Philadelphia.
- Heritage Data (2014). English Heritage Linked Data Vocabularies for Cultural Heritage. Retrieved September 1, 2014, Retrieved from http://www.heritagedata.org/blog/vocabularies-provided/
- Falkingham, G. (2005). A whiter shade of grey: A new approach to archaeological grey literature using the XML version of the TEI guidelines. Internet Archaeology, 17(online). [Retrieved September 1, 2014.] Retrieved from http://intarch.ac.uk/journal/issue17/index.html
- Friedman, C., Kra, P., Yu, H., Krauthammer, M., & Rzhetsky, A. (2001). GENIES: A natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics (Oxford, England), 17(suppl 1). Retrieved from http://bioinformatics.oxfordjournals.org/content/17/suppl_1.toc.
10.1093/bioinformatics/17.suppl_1.S74 Google Scholar
- Golub, K. (2006). Automatic subject classification of textual Web documents. Journal of Documentation, 62(3), 350–371.
- Golub, K., Lykke, M., & Tudhope, D. (2014a). Enhancing social tagging with automated keywords from the Dewey Decimal Classification. Journal of Documentation, 70(5), 801–828.
- Golub, K., Tudhope, D., Zeng, M., & Žumer, M. (2014b). Terminology registries for knowledge organization systems—functionality, use, and attributes. Journal of the Association for Information Science and Technology, 65(9), 1901–1916.
- Grishman, R., & Sundheim, B. (1996). Message Understanding Conference-6: a brief history. Proceedings of the 16th International Conference on Computational Linguistics (COLING 1996). Copenhagen.
- Grover, C., Givon, S., Tobin, R., & Ball, J. (2008). Named entity recognition for digitised historical texts. Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008). Marrakech.
- Hardman, C., & Richards, J.D. (2003). OASIS: Dealing with the digital revolution. In M. Doerr & A. Sarris (Eds.), The digital heritage of archaeology: Computer applications and quantitative methods in archaeology (pp. 325–329). Heraklion: ICS Publications.
- Hobbs, J.R. (1993). The Generic Information Extraction System. Proceedings of the 5th Message Understanding Conference (MUC-5). Baltimore.
- Isaac, A., & Summers, E. (2009). SKOS Simple Knowledge Organization System Primer. Retrieved September 1, 2014, Retrieved from: http://www.w3.org/TR/skos-primer
- ISO 21127. (2006). CIDOC Conceptual Reference Model. A reference ontology for the interchange of cultural heritage information. Retrieved September 1, 2014, Retrieved from http://www.iso.org/iso/catalogue_detail?csnumber=34424
- ISO 25964-1. 2011. ISO 25964. Thesauri and interoperability with other vocabularies. Part 1: Thesauri for information retrieval. Retrieved September 1, 2014, Retrieved from http://www.niso.org/schemas/iso25964/
- ISO 25964-2. 2013. ISO 25964. Thesauri and interoperability with other vocabularies. Part 2: Interoperability with other vocabularies. Retrieved September 1, 2014, Retrieved from http://www.niso.org/schemas/iso25964/
- Jeffrey, S., Richards, J., Ciravegna, F., Waller, S., Chapman, S., & Zhang, Z. (2009). The Archaeotools project: faceted classification and natural language processing in an archaeological context. In Special Theme Issues of the Philosophical Transactions of the Royal Society A, Crossing Boundaries: Computational Science, E-Science and Global E-Infrastructures, 2507–2519.
- Leroy, G., & Chen, H. (2005). Genescene: An ontology-enhanced integration of linguistic and co-occurrence based relations in biomedical text. Journal of the American Society for Information Science and Technology, 56(5), 457–468.
- Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1), 3–26.
10.1075/li.30.1.03nad Google Scholar
- Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys, 41(2), 10–11.
- OPTIMA. (2012) Project Resources. Retrieved September 1, 2014, Retrieved from: http://sourceforge.net/projects/optimacidoc/
- Ore, C.-E., & Eide, Ø. (2009). TEI and cultural heritage ontologies: Exchange of information. Literary and Linguist Computing, 24(2), 161–172.
- Peters, W., Aswani, N., Bontcheva, K., & Cunningham, H. (2005). Quantitative evaluation tools and corpora v1. Technical report, SEKT project deliverable D2.5.1.
- Richards, J., & Hardman, C. (2008). Stepping back from the trench edge. In M. Greengrass & L. Hughes (Eds.), The virtual representation of the past (pp. 101–112). Farnham England: Ashgate.
- Richards, J., Jeffrey, S., Waller, S., Ciravegna, F., Chapman, S., & Zhang, Z. (2011). Archaeology data services and the Archaeotools project: Faceted classification and natural language processing. In S. Whitcher Kansa, E.C. Kansa, & E. Watrall (Eds.), Archaeology 2.0. new approaches to communication & collaboration (pp. 31–56). Los Angeles: Cotsen Institute of Archaeology Press.
- Savary, A., Waszczuk, J., & Przepiórkowski, A. (2010). Towards the annotation of named entities in the national corpus of Polish. Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC′10). Valletta.
- Thelwall, M., & Buckley, K. (2013). Topic-Based sentiment analysis for the Social Web: The role of mood and issue-related words. Journal of the American Society for Information Science and Technology, 64(8), 1608–1617.
- Thompson, R., Shafer, K., & Vizine-Goetz, D. (1997). Evaluating Dewey concepts as a knowledge base for automatic subject assignment. Proceedings of the second ACM international conference on Digital libraries (pp. 37–46). Philadelphia: ACM Press.
10.1145/263690.263790 Google Scholar
- Tudhope, D., Binding, C., Blocks, D., & Cunliffe, D. (2006). Query expansion via conceptual distance in thesaurus indexed collections. Journal of Documentation, 62(4), 509–533.
- Tudhope, D., May, K., Binding, C., & Vlachidis, A. (2011). Connecting archaeological data and grey literature via semantic cross search. Internet Archaeology, 30 (online). [Retrieved September 1, 2014.] Retrieved from http://intarch.ac.uk/journal/issue30/tudhope_index.html/
- Uren, V., Cimiano, P., Iria, J., Handschuh, S., Vargas-Vera, M., Motta, E., & Ciravegna, F. (2006). Semantic annotation for knowledge management: Requirements and a survey of the state of the art. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, 4(1), 14–28.
- US NIST. (2003). The ACE 2003 evaluation plan. US National Institute for Standards and Technology (NIST). Retrieved September 1, 2014, Retrieved from http://www.itl.nist.gov/iad/mig/tests/ace/2003/ (Accessed 12 June 2012).
- Vlachidis, A., & Tudhope, D. (2012). A pilot investigation of information extraction in the semantic annotation of archaeological reports. International Journal of Metadata, Semantics and Ontologies, 7(3), 222–235.
10.1504/IJMSO.2012.050183 Google Scholar
- Vlachidis, A., & Tudhope, D. (2013). The semantics of negation detection in archaeological grey literature. In E. Garoufallou & J. Greenberg (Eds.), Metadata and semantics research. Communications in computer and information science (vol. 390, pp. 188–200).
- Zelenko, D., Aone, C., & Richardella, A. (2003). Kernel methods for relation extraction. Journal of Machine Learning Research, 3, 1083–1106.
- Zeng, M., & Chan, L. (2004). Trends and issues in establishing interoperability among knowledge organization systems. Journal American Society of Information Science, 55(2), 377–395.
- Zhang, Z., Chapman, S., & Ciravegna, F. (2010). A Methodology towards effective and efficient manual document annotation: Addressing annotator discrepancy and annotation Quality. Lecture Notes in Computer Science, 6317, 301–315.
- Zipf, G.K. (1936). The Psycho-biology of language: An introduction to dynamic biology ( second ed.). Cambridge: MIT Press. (1968).