CLEU ‐ A Cross‐language english‐urdu corpus and benchmark for text reuse experiments

Text reuse is becoming a serious issue in many fields and research shows that it is much harder to detect when it occurs across languages. The recent rise in multi‐lingual content on the Web has increased cross‐language text reuse to an unprecedented scale. Although researchers have proposed methods to detect it, one major drawback is the unavailability of large‐scale gold standard evaluation resources built on real cases. To overcome this problem, we propose a cross‐language sentence/passage level text reuse corpus for the English‐Urdu language pair. The Cross‐Language English‐Urdu Corpus (CLEU) has source text in English whereas the derived text is in Urdu. It contains in total 3,235 sentence/passage pairs manually tagged into three categories that is near copy, paraphrased copy, and independently written. Further, as a second contribution, we evaluate the Translation plus Mono‐lingual Analysis method using three sets of experiments on the proposed dataset to highlight its usefulness. Evaluation results (f1=0.732 binary, f1=0.552 ternary classification) indicate that it is harder to detect cross‐language real cases of text reuse, especially when the language pairs have unrelated scripts. The corpus is a useful benchmark resource for the future development and assessment of cross‐language text reuse detection systems for the English‐Urdu language pair.


Introduction
Text reuse, the process of creating new texts using existing ones, has become very common because of free, readily available, and large digital repositories. In addition, stateof-the-art text processing applications have made it very simple to copy-paste text and give it a new identity. Text borrowed from such sources can be reused verbatim (copypaste) or rewritten (paraphrased). If the rewriting process involves complex editing operations (e.g., lexical substitution, changes in syntax, summarization, synonym replacement, altering word order, or verb or noun nominalization) then the borrowed text transforms into an independently written piece (Clough, Gaizauskas, Piao, & Wilks, 2002;Maurer, Kappe, & Zaka, 2006). Moreover, new text can be created using text from one or more sources and the amount of reused text varies from local text reuse (such as, a single word, small chunks, or sentences) to global text reuse (i.e., an entire document; Mittelbach, Lehmann, Rensing, & Steinmetz, 2010;Seo & Croft, 2008).
Unlike academic plagiarism (the unacknowledged reuse of text), text reuse is a common practice in journalism. Newspapers pay news agencies for their text(s) (here termed source text) to generate news stories (termed derived text). The text purchased from a news agency can be reused "verbatim" or "paraphrased" to create the newspaper story. However, at times the newspaper story might also be independently written without using any news agency text (Clough, 2010).
Text reuse can either be mono-lingual (when the source and derived text share the same language) or cross-lingual (when the source text is in one language and the derived text is in another). Mono-lingual text reuse detection has been a subject undergoing intense study for the research community for some time, but recently the focus has shifted towards detecting text reuse across languages (Ceska, Toman, & Jezek, 2008;Franco-Salvador, Gupta, Rosso, & Banchs, 2016;Gupta, Barrón-Cedeño, & Rosso, 2012;. A recent study suggested that the scale of cross-language text reuse and plagiarism is increasing (Barrón-Cedeño, Gupta, & Rosso, 2013). This is because of the following reasons: (a) users of under-resourced languages, which are very large in number, commonly use text(s) from resource-rich languages, (b) speakers of one language staying in a country other than their own can consult the text(s) in their native language, and (c) often speakers of one language are keen to write in a foreign language. Likewise, the recent rise in multi-linguality, freely available machine translation systems, and intelligent word processors are contributing to an environment where it is easy to reuse text across languages, but with a perception of being harder to detect such reuse (Somers, Gaspari, & Niño, 2006). Therefore, there is an ever-increasing necessity to develop standard evaluation resources and methods to detect cross-language text reuse for the various language pairs. To develop, evaluate, and analyse methods for crosslanguage text reuse (either local or global), gold standard benchmark corpora are needed. These corpora can be generated in three ways: (a) artificial -using an automatic text altering tool, (b) simulated -humans are asked to rewrite source text to create new text, and (c) real -new agency's text is reused by journalists to create the newspaper story. It seems likely that cross-language text reuse detection methods which are trained on real examples are more likely to give realistic performance that we investigate further in our paper.
This study aims to develop a publicly available largescale benchmark corpus that contains real examples of cross-language text reuse at sentence/passage level 1 for the English-Urdu language pair. Urdu belongs to the Indo-Aryan family, widely spoken in Pakistan and the northern parts of India (Alam, Mehmood, & Nelson, 2015). Moreover, it has a strong Perso-Arabic influence in its vocabulary and is written in a Perso-Arabic script from right to left. It is also spoken world-wide because of the South Asian Diaspora (with large populations in the Middle East, United States, UK, Norway, and Canada etc.; Daud, Khan, & Che, 2016). Despite that, for the English-Urdu language pair, there are no publicly available crosslanguage text reuse detection datasets known to us. Moreover, previous research has tended to focus more on European languages.
The corpus developed as an outcome of this study contains 3,235 pairs of real examples of cross-language text reuse at sentence/passage level (the source text is in English whereas derived text is in Urdu). Each sentence/passage pair is categorised as i) Near Copy (NC; 751 pairs), ii) Paraphrased Copy (PC; 1751 pairs), or iii) Independently Written (IW; 733 pairs). The corpus is representative enough to serve as a benchmark dataset for: (a) developing and evaluating techniques for cross-language text reuse detection for the English-Urdu language pair, (b) obtaining an insight into what edit operations are likely used by journalists in reusing text, and (c) to foster text reuse detection research in the English-Urdu language pair.
The remainder of this article is organized as follows. We first review previously developed cross-lingual text reuse or plagiarism detection corpora. Then we present a detailed discussion on the CLEU corpus construction, its statistics, characteristics, linguistic analysis, and example cases. This is followed by the explanation of crosslanguage text reuse detection experiments that we performed on our corpus to highlight its strengths and its utility for evaluation purposes. Finally, we present the results and their analysis and then conclude the article.

Related Work
In the previous literature, efforts have been made to develop standard evaluation resources for measuring crosslanguage text reuse (and plagiarism) for different the language pairs. For example, PAN 2 authors have developed a series of corpora with artificial and simulated examples of plagiarism at document level (Potthast, Barrón-Cedeño, Eiselt, Stein, & Rosso, 2010;Potthast, Eiselt, Barrón-Cedeño, Stein, & Rosso, 2011;Potthast et al., 2012Potthast et al., -2014Stein, Rosso, Stamatatos, Koppel, & Agirre, 2009). The majority (90%) of the text plagiarism cases in these corpora are mono-lingual, however, there exists a small portion (10%) of cross-lingual plagiarism cases too. These crosslanguage plagiarism cases are for the English-German and English-Spanish language pairs. Most of these cases are artificial (created using automatic MT [Machine Translation] system that is, Google Translate 3 ) but a small number of them are created manually (i.e., translated by humans). These corpora have been used to evaluate text plagiarism detection methods in the competitions held annually.
The CL!TR 4 (Cross-Language Indian Text Reuse) corpus is the first of its kind developed specifically for the analysis of cross-language text reuse detection in the Hindi-English language pair at document level (Barrón-Cedeño, Rosso, Devi, Clough, & Stevenson, 2013). The suspicious documents it contains are in Hindi and the source documents in English language. The training set includes 198 suspicious (Hindi) and 5,032 source (English) documents, whereas the test set has 190 suspicious (Hindi) and 5,032 source (English) documents. The CL!TR corpus contains simulated cases of text reuse. The volunteers involved in the study were asked to answer a set of 10 questions, related to the tourism and computer science domains, to create suspicious documents. It contains three types of revisions, categorized by the amount of obfuscation used, namely "Exact" (without any modifications, translation only), "Light" (very few modifications, translation, and manual correction), and "Heavy" (detailed modifications, translation, and manual correction). The corpus also contains "Original" (independently written) documents which were generated without referring to the source documents but using the learning material provided.
Another cross-language corpus of 110 documents (55 source in English and 55 plagiarised in Bangla) that contains simulated plagiarism cases and was built using student's reports from a university (Arefin, Morimoto, & Sharif, 2013). Two groups of 55 students each, were asked to write a report on a given topic. 50 reports are used as training set whereas the remaining 10 as test set. Plagiarism cases were obfuscated by replacing contents with several plagiarized fragments of different lengths. However, the corpus is not available to download.
Recently, a cross-language (Urdu-English language pair) document level plagiarism detection corpus was submitted for the PAN 2016 shared task (Hanif et al., 2015). The corpus is divided in two sets, 500 source (Urdu) and 500 suspicious (English) documents, and contains only simulated examples of plagiarism. The source documents are Wikipedia excerpts whereas the plagiarized documents were manually created by university students. The students were asked to plagiarize 270 documents on three levels of obfuscation ("Near Copy," "Light Revision," and "Heavy Revision"), whereas 230 documents in the corpus are "Nonlabialized." Moreover, the plagiarism cases inserted in the suspicious documents are of various length that is small ( < 50 tokens), medium (50-100 tokens), and large (100-200 tokens). The corpus is the first crosslanguage (Urdu-English pair) dataset created for plagiarism detection research at the document level.
CLiPA (Cross-Language Plagiarism Analysis) is a publicly available fragment or sentence level corpus containing five source sentences (in English) which were used to generate plagiarized cases (in Spanish and Italian) using both machine translation (artificial) and manual translation (simulated; Barrón-Cedeño, Rosso, Pinto, & Juan, 2008). The machine translation cases were generated using five different services to have variations whereas for manually (human) simulated plagiarism cases, nine volunteers were asked to plagiarize each of the five source fragments. They were further requested to generate the same number of nonplagiarized cases as well. The corpus was used in experiments on text plagiarism detection research in the English-Spanish and English-Italian language pairs. In summary, the corpora discussed above either contain artificial or simulated examples of cross-language text reuse (or plagiarism). Cross-language text reuse detection methods developed using these non-real types of text reuse are unlikely to perform well on real cases of text reuse that occur in real world scenarios (e.g., academia, journalism; Weber-Wulff, 2010).
Moreover, the simulated cases created in a controlled environment using crowd-sourcing do not represent the strategies used by humans when rewriting text in real life. Because cross-language text reuse is increasing day-byday, first, there is an urgent need to develop text reuse detection corpora with real examples of text reuse. Second, the available corpora for research are created at document level and there are no corpora available at sentence/passage level for the English-Urdu language pair. Last, the corpora listed above are not large enough to generate robust results. This is not surprising because it takes a lot of manual effort to create corpora with simulated examples of text reuse or plagiarism.
To develop and evaluate cross-language text reuse detection methods for the real-world scenario, we need to create corpora with real examples of text reuse. To fill this gap, our research work proposes a large-scale gold standard benchmark corpus containing real examples to measure cross-language text reuse at sentence/passage level for the English-Urdu language pair. The next section describes the corpus generation process in detail.

Extracting Sentence/Passage Pairs
To construct our corpus, we first collected a dump of 1,800 "related" news archives on a wide range of topics from the domain of journalism. Half of them (900) were news agency articles written in English, whereas the remaining half (900) were Urdu news stories published in the popular Urdu newspapers of Pakistan. It is a common practice in the newspaper industry that the news text released by news agencies is directly used by the newspapers verbatim or adapted by rephrasing it (Clough, 2012). Each of the collected news agency articles (English text) had a one-to-one mapping with the newspaper story (Urdu text), but as practiced in journalism, the newspaper story may or may not contain text from the news agency article. We targeted two well-known English news agencies namely, Associated Press of Pakistan (APP) and Independent News Agency (INP) and collected English news articles released by these agencies on a range of topics including sports, politics, business, showbiz, technology, local, and foreign news. The Urdu newspaper stories were extracted from the top five large circulation national dailies that is Daily Jang, Daily Dunya, Express, Daily Aaj, and Nawa-e-Waqt for 6 months from September 2015 to February 2016. The news text collection was carried out throughout each month excluding the public holidays on which either the newspaper was not published, or the news agency did not provide the service.
In the next step, from the set of 1,800 related English-Urdu news articles, a total of 3,235 sentence/passage pairs were manually extracted by one of the authors (who is a native speaker of Urdu, with high proficiency in English, and has expertise in the field of text reuse and plagiarism detection). The sentence/passage pairs extracted from the news agency articles (in English) were considered as "source" text whereas the newspaper stories (in Urdu) were "derived" text.

Annotation Guidelines
After extracting the related sentence/passage pairs, the next step was to manually classify each pair into one of the three categories that is, (a) Near Copy (NC), (b) Paraphrased Copy (PC), or (c) Independently Written (IW), depending upon the relationship between them. To classify a pair into one of the three categories, the following guidelines were prepared: Near Copy: A sentence/passage pair will be tagged as "NC" if the reused text is almost an exact translation of the source text. However, because of the cross-language setting, small alterations in the reused text will be ignored. Additionally, a small amount of new text may also appear in the reused text because of the structural difference in both languages (see section Examples from the Corpus for NC example).
Paraphrased Copy: For a pair to be tagged as "PCO," its contents must be semantically the same, describing the same information. However, the reused text must not be mere translations of the source text. Rather, the source text should be paraphrased using different text editing operations including word/sentence re-ordering, merging/splitting of sentences, insertions/deletions of new text, replacing words/phrases with appropriate synonyms, expansion/compression of text etc. (see section Examples from the corpus for PC example).
Independently Written: To tag a sentence/passage pair as "IW," the context of the news should be the same or both texts must be describing the same event. However, the reused text must not be borrowed from the news agency text (although there may be individual words that co-occur). Moreover, a lot more new information could be present in the reused text with completely different facts and figures (see section Examples from the Corpus for IW example).

Annotations
With these guidelines in mind, each sentence/passage pair was then manually tagged into one of the three classes (i.e., NC, PC, or IW) by three annotators (X, Y, and Z). The annotators were post-graduate NLP (Natural Language Processing) students, native Urdu speakers with a high level of proficiency in English language. Furthermore, they were provided with training about the journalistic text reuse phenomena and with tutorials on different text rewriting operations by a linguist. The purpose of training was to comprehend the concepts, particularly, related to different levels of text reuse. The annotations began with two annotators, annotator X and annotator Y. Both were given a task to annotate a small subset of 100 sentence/passage pairs. The resulting annotations of this subset were discussed to further refine the annotation guidelines and the results were saved. Next, the remaining 3,135 sentence/passage pairs in the corpus were annotated using the final guidelines presented above and inter-annotator agreement was computed on the entire corpus. The conflicting pairs were annotated by annotator Z.
As shown in Table 1, most conflicts arose in discriminating between PC and IW sentence/passage pairs (199) because it is hard to distinguish substantially paraphrased text from independently written one. The number of conflicts between NC and PC classes is also a reasonable number (109). Potentially, this is because of the small text fragments and cross-language settings, meaning that it might have become difficult for annotators to distinguish between NC and PC classes. However, there are only two conflicts between NC and IW classes. This means that it was relatively easy for annotators to distinguish between these two classes. The 310 conflicts were resolved by annotator Z, favoring 143 (46%) times annotator A and 165 (53%) times annotator B. Moreover, the two conflicts between NC and IW were resolved as PC.

Corpus Statistics
Our benchmark cross-language text reuse detection corpus contains 3,235 source-derived sentence/passage pairs. More than half of the pairs in the corpus belong to the PC category (54.12%). The number of NC (23.21%) and IW (22.65%) pairs are of almost equal proportions. Detailed statistics can be found in the Table 2. The corpus is freely

Examples From the Corpus
In this section, we present representative NC, PC, and IW sentence/passage pairs examples from our corpus. 6 The first two examples show a Near Copy sentence and a passage pair from the proposed corpus. As can be noted, the reused text is almost exact translation of the source text. Moreover, in both cases the order of information is also preserved.

NC sentence pair example Source
The chief minister said he would personally monitor the programme of repair and construction of roads in rural areas and review the pace of progress on fortnightly basis. Reused

Reused-translation
The chief minister said, "[I] will personally monitor the program of the construction and repair of roads in rural areas and will review the progress after every fifteen days."

NC passage pair example Source
The Chief Minister directed that the program of Google mapping of the land be extended to the whole province and an effective awareness campaign be initiated regarding the importance of the project. He appreciated concerned institutions and authorities over speedy progress on the project and said that it was a publicwelfare project and would directly benefit the masses.

Reused-translation
The Chief Minister directed that the scope of Google mapping program of the land be extended to the entire Punjab and effective awareness campaign be launched to highlight the importance of the project. The Chief Minister also praised the relevant institutions and authorities on the rapid progress of the project and said that it is an extremely important project of public interest, which would directly benefit the masses. The next two examples show cross-language Paraphrased Copy sentence and passage pairs from the corpus.
In both examples, the reused text is either compressed or expanded version of the source text. Although the content is similar, both texts convey the same information. It can also be noted that words (or phrases) have been reordered to generate the reused text. In the first example, the source text has only one sentence whereas the derived text has been split into two sentences, whereas in the second example, the source passage is comprised of two sentences whereas the reused text has been reduced to only one sentence. Furthermore, some words have been replaced with appropriate synonyms. For instance, "king" is replaced with "Shah Abdullah," "Medical City" with "Medical Center" in the reused text. The second example also exhibits a temporal change, as in the source text the age of the person mentioned "around 90" is replaced with "91" in the derived text. These transformations highlight the fact that complex rewriting operations have been used by the journalists. However, while creating the derived text from the source text, meanings of the source text have been preserved.
PC sentence pair example Source Prime Minister Nawaz Sharif recalled that in start of December he had announced reduction of 2.32 rupees per unit in prices of electricity, but the announcement was deferred because of Peshawar school tragedy. Reused

Reused-translation
Prime Minister said that the price of electricity has been decreased by 2 rupees 32 paisas per unit. In the coming days [we] are trying to make electricity more affordable. The final two examples are the cross-language Independently Written sentence and passage pairs. It can be noted that both source and derived texts are describing the same news. However, the facts, figures, or contents are altered, and the way of expression is entirely different. In addition, some new (but related) information is also added in the reused text which is not present in the source text. This shows that reused text is generated independently of the source text and any overlap of words is very low between the text pair.
IW sentence pair example Source Prime Minister Muhammad Nawaz Sharif Wednesday announced further reduction in the prices of petroleum products, which will be effective from January 1, 2015. Reused

Reused-translation
Prime Minister Nawaz Sharif has announced reduction in prices of petroleum products up to Rs. 14 per liter. Petrol 6.25, diesel 7.86, kerosene oil 11.26 rupees per liter cheaper [than before].

IW passage pair example Source
The sources said that two inmates of a house in the precincts of PS Brewery died because of suffocation caused by the gas leakage. The victims were shifted to Bolan Medical College Hospital and later their bodies were handed over to the heirs. Further probe was in progress.

Reused
Reused-translation According to rescue sources, the incident took place in Hazara Town and those in one room of the house left the gas heater on and slept. Because of the gas filling in the room two real brothers died of suffocation while sleeping.

Linguistic Analysis of the Transformations
Further to highlight the breadth of reuse in our corpus, we carried out a linguistic analysis by investigating the most common edit operations (or transformations) that have been used to generate the reused text (in Urdu) from the source text (in English). For our study, we have explored ten different types of transformations (following the taxonomy used by Clough, [2012], which are described, with examples, as follows. Temporal change (time/date change): This involves changing time or date from the source text to create the new text (e.g., todayyesterday, 1998last year, earlier in this year -March etc.). Below is an example from the corpus that shows the use of word " " (today) to replace the day/date "Tuesday, 30 Dec." 7 Trade bodies announced to close the city markets to mourn the sad incident on Tuesday, Dec 30.
Change of tense: This involves creating the new text by making a shift in the tense. For instance, present tense can be changed to past tense (e.g., he is goinghe went). The example that follows shows a source sentence with past tense has been rewritten with present tense.
They were awarded death sentence over insufficient proof.
Use of pronoun for proper noun: While reusing text, proper noun can be changed to pronoun and vice versa (e.g., John Walkerhe). The following example from our corpus exhibits the use of a proper noun " " (Nawaz Sharif ) in derived text. This is part of my vision of a developed KPK and a real Naya Pakistan, he said.
Change in phrase order: In this type of transformation, the order of words or phrases in a text are changed when 7 In all the examples, words are underlined to highlight them. reusing it. The corpus has plenty of such examples and one of them is shown below (because Urdu is written RTL, the underlined phrase in the Urdu text is actually the last phrase in the sentence).
No further information was immediately available, and the cause of the incident was being investigated.
Addition/deletion of text: This involves adding new information or deleting text in the source fragment to create the derived one. The following example shows addition of new information " " (held from Pakistan) at the end of derived text.

US transfers five Guantanamo detainees to Kazakhstan.
Change of voice: Sometimes the source text is converted from active to passive voice (or vice versa) to create the new text. We show here one such example from our corpus.
Director General Inter-Services Intelligence (ISI) Lt. General Rizwan Akhtar on Saturday called on Prime Minister Muhammad Nawaz Sharif at the Prime Minister's House.
Change in name/title reference: This type of change occurs when a title/designation is replaced with the actual name or vice versa (e.g., John Smithdirector). Below is an example in which "PM" (Prime Minister) is replaced by " " (Nawaz Sharif ) in the derived text.
PM orders to cut power tariff by Rs 2.32.
Direct to indirect: This type of transformation occurs when a text is converted from direct to indirect speech or vice versa (e.g., Ali said,"I am writing a letter now". -Ali said that he was writing a letter then.). Below is the example from our corpus which shows a sentence changed from direct to indirect speech.
"They must tell the players their requirement from the players," he said.

Change in expression:
This type involves the change of exact figures with estimations or rewordings (e.g., 20%one in five). The example that follows shows the injured persons figure of "40" replaced with " " (several injured) in the derived text.
Yemen suicide bomber kills 33 at Shiite celebration, over 40 injured.
Synonym substitution: This type exhibits the replacement of a word or phrase with an appropriate synonym. The following example shows a word "embarrassment" and a phrase "wrong facts and figures" in the source text substituted with " " (humiliation) and " " (erratic), in the derived text, respectively. Imran Khan presented wrong facts and figures in his recent news conference addressed at Banigala to avoid embarrassment.
To investigate the usage of the above transformations in our CLEU corpus, we took a random subset of 189 pairs (63 NC, 63 PC, and 63 IW) from the corpus. 8 We manually computed the frequencies of the 10 transformations in the subset of 189 pairs. Note that because of large size of the corpus, it was very difficult and time consuming to compute these frequencies for the entire corpus, therefore, a reasonable subset is used. 9 Table 3 shows the absolute (fr abc ) and relative (fr rel ) frequencies of transformations in the set of 63 each Near Copy, Paraphrased Copy, and Independently Written pairs. For Near Copy pairs, the most common edit operation is "Synonym substitution" (0.263) followed by "Temporal change" (0.210) and "direct to indirect" (0.210). As expected, four of the transformations have (0) frequency. This demonstrates the fact that when sentence/passage pairs have Near Copy relations, they are almost exact copy of each other and very few edit operations are involved.
For Paraphrased Copy pairs, "Addition/deletion of text" is the most common edit operation (0.268), followed by "Synonym substitution" (0.194), and "Change in phrase order" 0.127. This demonstrate that these three are the favorite text altering operations used by the journalists in rewriting a newspaper story using the news agency's text. Although other transformations have relatively low frequencies, but almost all the edit operations are being used in creating derived text (total = 149). This means that complex rewriting has been used to create the Paraphrased Copy text.
Finally, the absolute and relative frequencies of edit operations for the Independently Written pairs shows "Addition/deletion of text" transformation has the highest frequency (0.385) which is like Paraphrased Copy pairs. However, this relative frequency is much higher than the second best "Change of tense" (0.117). This shows that in Independently Written pairs information is frequently added or deleted. Moreover, it further highlights the fact that reused text is independently written without borrowing text from the news agency's copy.

Text Reuse Detection Experiments
In this section, we describe the cross-language text reuse detection experiments that we performed on our corpus. A recent study (Barrón-Cedeño, Gupta, & Rosso, 2013) compared different text similarity estimation methods and showed that Translation + Monolingual Analysis (T+MA) outperformed others. Therefore, we also applied the same approach to our corpus.

Translation Plus Monolingual Analysis
In T+MA, the overall approach is to translate the derived text to the language of the source text and then apply stateof-the-art mono-lingual analysis methods. Therefore, to start with, we translated all the derived text sentence/passage pairs from Urdu to English language using an MT system, that is Google Translate. In the second phase, we applied several mono-lingual text similarity estimation methods such as, Word n-gram overlap, Longest Common Subsequence, and Greedy String Tiling, on our corpus.
The Word n-gram overlap method tries to estimate the amount of common n-grams between source and derived texts. It has been observed that overlapping of longer chunks of ngrams indicates a potential text reuse (Adeel Nawab, Stevenson, & Clough, 2012;Clough & Stevenson, 2011). It is one of the simplest method used in text reuse detection that could easily be applied on a large collection of texts because of its low complexity. As our corpus contains sentence/passage level data (smaller units than a document), we computed the scores for lengths of n from [1-3] (i.e., Unigram, Bigram, and Trigram, see the Results and Discussion section) using overlap similarity measure (Manning & Schütze, 1999). Moreover, we merged the set of all three features (i.e., Unigram, Bigram, and Trigram) as combined (Combined, see the Results and Discussion section) input to the classification task.
Some of the other text reuse detection methods based on string matching algorithms are the Longest Common Subsequence (LCS) and Greedy String Tiling (GST). LCS is typically used for comparing files whereas GST is specifically proposed for plagiarism detection in texts (Wise, 1993). The Longest Common Subsequence of two texts, X and Y, is the longest group of elements that are common between the two and in the same order in each text. For example, the sequences "ABCDGH" and "AEDFHR" have an LCS of "ADH." For each source and derived sentence/ passage pair, we computed the normalized Longest Common Subsequence score (LCS norm , see the Results and Discussion section) by dividing the length of LCS on the length of the shorter text.
Greedy String Tiling identifies the longest rewritten sequence of substrings from the source text and returns the sequence as tiles (for which a minimum match length [mml] value is provided) paired with the derived text. It is a powerful algorithm as it can detect matches even if some of the text is deleted or if additional text has been inserted. We used the well-known Running Karp-Rabin Matching and Greedy String Tiling (RKR-GST; Wise, 1993) implementation and experimented with mml lengths of [1-3] (i.e., GST-mml1, GST-mml2, and GST-mml3, see the Results and Discussion section).

Experimental Setup
In this section, we describe the set of experiments performed on our corpus. To estimate the similarity between source and derived text pairs, we designed three sets of experiments (i.e., Exp1, Exp2, and Exp3). For Exp1, we used the source and derived texts without any preprocessing settings. For Exp2, we excluded all the stop words 10 from both source and derived texts, then ran the experiment. For Exp3, after removing stop words, we further stem the source and derived texts using the Snowball stemmer (Porter, 2001). The reason for including Exp2 and Exp3 was to analyze the effect of preprocessing on text reuse detection experiments. For all three experiments, the whole corpus (i.e., 751 NC, 1,751 PC, and 733 IW pairs) was used. Moreover, we investigated both binary and ternary classifications settings for each experiment. For the set of experiments using binary classification, NC and PC sentence/passage pairs were combined to make derived (D) class whereas IW pairs were used as nonderived (ND) class. For both (binary and ternary) supervised classification tasks, the Random Forest classifier, 11 with 10-fold crossvalidation, implemented in the WEKA 12 machine learning platform is used. To evaluate the performance, results are reported using weighted average precision (p), recall (r), and the harmonic mean of both that is the F 1 -measure (f 1 ).

Results and Discussion
In this section, we discuss the results obtained as an outcome of the experiments performed on the whole corpus. The aim is to distinguish between multiple levels of text reuse at sentence/passage level. The evaluation results obtained using Exp1, Exp2, and Exp3 are shown in Tables 4, 5, and 6, respectively.
Consistently and following expectations across all three experiments, the results are higher using binary classification settings. This indicates that in text reuse detection experiments it is comparatively easier to differentiate between two classes than three (see Figure 1 and Figure 2). Overall, Unigram and GST-mml1 performed better than LCS for Exp2 and Exp3, and Combined and Unigram performed better for Exp1. This is consistent with the METER study (Clough, Gaizauskas, Piao, & Wilks, 2002) and our recent experiments on Urdu text reuse detection (Sharjeel, Nawab, & Rayson, 2017), which also highlights that Word n-gram overlap and GST are the most appropriate methods for the problem of text reuse detection. Furthermore, it can also be noticed that the results for higher values of n (in Word n-gram overlap) and mml (in GST) are low. This shows that when the text is heavily derived from the source, using smaller lengths of lexical units (words) are more effective in detection experiments. Moreover, the effect of preprocessing (stop words removal and stemming) on text reuse detection is visible in Exp2 and Exp3, where results are slightly better than Exp1 (see Figure 1 and Figure 2). This supplements the previous studies on text normalization and its effects on text reuse and plagiarism detection experiments (Ceska & Fox, 2009;Chong & Specia, 2011). Last, the high error rate in low quality MT for the Urdu-English language pair has resulted in relatively lower precision results in all three experiments.
For Exp1 (see Table 4), Word n-gram overlap and the combination of Unigram, Bigram, and Trigram that is Combined performed better than the more complex LCS and GST methods. For binary classification, the best results are obtained using Combined approach (f 1 =0.711), whereas for ternary classification it is by using Unigram (f 1 =0.493). This highlights that n-gram overlap with smaller values of n can capture the overlap of text between source and derived sentence/passage pairs much better. Moreover, the classifier is better suited for the binary classification problem using a combination of features than one and has produced significantly better results.
For Exp2 (see Table 5), the GST method reported comparatively better results. For both classifications, highest FIG. 2. Weighted average f 1 results plots for ternary classification results are obtained using mml1 (binary f 1 =0.735, ternary f 1 =0.549). As the stop words are removed from the text, GST can find longer chunks in the paraphrased text which results in its high performance. Word n-gram overlap performed second best (binary f 1 =0.715, ternary f 1 =0.528) for smallest value of n, but its performance dropped with the increasing length of n. This shows that even if stop words are removed, small values of n produce best results in text reuse detection experiments on our corpus. On the other hand, LCS reported lower scores as it suffers from the block move problem (Wise, 1993).
Interestingly, for Exp3 (see Table 6), Word n-gram overlap scored highest in binary classification (f 1 =0.732) whereas GST is best in ternary classification (f 1 =0.552). Again, the best results are with the smallest values of n (in Word n-gram overlap) and mml1 (in GST). This further supports our statement that in the text reuse detection problem, even after applying preprocessing, selecting a smaller length of tokens is more effective. However, preprocessing of text (stop words removal and stemming) has an impact on the detection of text reuse as can be seen the results in Table 6 which are significantly better  then Table 4. This improvement is statistically significant as tested with Wilcoxon signed-rank test (p < :05; Wilcoxon, Katti, & Wilcox, 1973).
Overall, the cross language text reuse detection results obtained on our corpus (best f 1 =0.552) are comparatively lower than for the METER corpus (best f 1 =0.664), which is a gold standard mono-lingual text reuse detection corpus (Clough, Gaizauskas, Piao, & Wilks, 2002). Moreover, the results are also low when compared with CL!TR corpus (best f 1 =0.600), which contains cross language text reuse cases for the English-Hindi language pair (Barrón-Cedeño, Rosso, Devi, Clough, & Stevenson, 2013). 13 The rationale being that our corpus contains real examples of paraphrased text whereas CL!TR is a simulated corpus, manually created by volunteers in a controlled environment, who were allowed to use online automatic tools to translate the text (from English to Hindi language) and then modify it. On the other hand, METER is a mono-lingual corpus (both source and suspicious texts are in the English language) but the problem becomes much harder when the source and derived texts do not share the same language (Barrón-Cedeño, Rosso, Agirre, & Labaka, 2010). Our study indicates that it is challenging to detect real cases of text reuse, especially, when they occur across languages. Furthermore, it becomes much more challenging when the languages involved are non-ideographic (e.g., English and Urdu), thus finding similarity between two sentences that have disparate structure is much harder. Unlike English, the Urdu language has a very complicated morphological structure where a word could have up to sixty different forms (Rizvi & Hussain, 2005). Therefore, when the text is reused across cross-script languages, it becomes hard to detect even the exact copy pairs because of the unique characteristics, features, and structure of different languages.
Both the corpora discussed above are annotated at document level and T+MA suffers on short cases of reuse (Barrón-Cedeño, Gupta, & Rosso, 2013), which has resulted in its low performance on our corpus. One of the key phases in the T+MA method is the language normalisation step. During this phase, it was observed that the available online MT systems are not of a good quality, moreover, very few of them support Urdu-English machine translation. We noticed that even the translation of verbatim or exact copy pairs were far from perfect. This inherited flaw has also been the reason for low performance of T+MA method.
To further support this observation, we randomly selected a sample of 300 sentence/passage pairs (100 from each class) and performed both machine and manual translations of the subset. Table 7 shows the comparison of T +MA results for both machine and manual translated examples. 14 It can be noticed that results of the manually translated sentence/passage pairs are better than machine translated ones in both binary and ternary classification tasks and across all experiments performed. This shows that T+MA method could perform comparatively better provided it incorporates a good MT system. Table 8 shows the confusion matrix (columns=predicted, rows=actual instances) for the Exp3 ternary classification experiment using GST-mml1 method that reported best result overall (see Table 6). The confusion matrix shows the misclassification of 432 NC and 510 IW as PC. This makes PC  14 We only reported the best f 1 results on the subset. The detailed results can be accessed at http://ucrel.lancs.ac.uk/textreuse/cleu.php the most problematic class for the three-class problem and indicates that it is difficult to discriminate between NC-PC and PC-IW pairs. Therefore, for ternary classification, the overall results were low.

Conclusion and Future Work
Cross-language text reuse detection has recently gained a lot of interest because of the rapid increase in multilingual content being readily available online. However, for the development of state-of-the-art cross-language text reuse detection methods, a major bottleneck is the unavailability of standard evaluation resources containing real cases of text reuse, especially when one of the languages (e.g., Urdu) is highly under-resourced in general. In this paper, we address this issue by contributing the first crosslanguage text reuse corpus for the English-Urdu language pair with real examples of reuse. The corpus contains sentence/passages collected from the news domain and manually tagged into one of the three classes (i.e., NC, PC, and IW) depending on the amount of adaptation. The source text is from the news agency articles in English language whereas the derived text extracted from the newspapers is in the Urdu language. We described the linguistic properties of such text reuse for the first time and compared related frequencies in a taxonomy of ten transformation types. In addition, we applied three sets of experiments using the popular T+MA cross-language text reuse detection method on the developed dataset. Evaluation results indicated that lower values of n (for Word n-gram overlap, best f 1 =0.732 binary classification) and mml1 (for GST, best f 1 =0.552 ternary classification) are the most fruitful in text reuse detection experiments on our corpus. Our results (best f 1 =0.552) also showed that the T+MA method applied cross-lingually is less accurate than in the mono-lingual case (METER corpus best f 1 =0.664) and less accurate with real reuse compared to simulated reuse (CL!TR corpus best f 1 =0.600). Furthermore, our experiments also revealed that text preprocessing operations that is, stop words removal and stemming have a positive effect on cross-language text reuse detection.
In the future, we plan to apply other state-of-the-art cross-language text reuse detection methods which work on the semantics of the text rather than the surface level to better capture the details of real cases of cross-language text reuse.