The Birth of Collective Memories: Analyzing Emerging Entities in Text Streams

We study how collective memories are formed online. We do so by tracking entities that emerge in public discourse, that is, in online text streams such as social media and news streams, before they are incorporated into Wikipedia, which, we argue, can be viewed as an online place for collective memory. By tracking how entities emerge in public discourse, i.e., the temporal patterns between their first mention in online text streams and subsequent incorporation into collective memory, we gain insights into how the collective remembrance process happens online. Specifically, we analyze nearly 80,000 entities as they emerge in online text streams before they are incorporated into Wikipedia. The online text streams we use for our analysis comprise of social media and news streams, and span over 579 million documents in a timespan of 18 months. We discover two main emergence patterns: entities that emerge in a"bursty"fashion, i.e., that appear in public discourse without a precedent, blast into activity and transition into collective memory. Other entities display a"delayed"pattern, where they appear in public discourse, experience a period of inactivity, and then resurface before transitioning into our cultural collective memory.


Introduction
Remembering is a social process (Halbwachs, 1950). Collective remembrance is the process in which information moves from public discourse into a shared collective memory. This process has been compared to the remembrance process of an individual, whose memories transfer from short-term into long-term memory (Assmann & Czaplicka, 1995). This comparison has been formalized by mapping the collective's equivalent of long-term and short-term memory to the cultural and communicative memory, respectively.
Cultural collective memory (CM), the collective's equivalent of an individual's long-term memory, is characterized by being organized, specialized, formal, structured, and distanced from the immediate (Assmann & Czaplicka, 1995). Wikipedia is known to "democratize information," through its collaborative nature: its content is produced by volunteer editors and authors from around the world (Wallace & Van Fleet, 2005). Wikipedia has been called an online place for cultural CM (Luyt, 2016;Pentzold, 2009). We support this view, and argue that the aforementioned characteristics fit Wikipedia's nature. First, Wikipedia is organized, through its hierarchy of contributors, where authors are distinguished from admins. Wikipedia is specialized, since appropriately citing relevant and expert sources to support and back up newly added information is a requirement. These conventions, requirements, and policies around contributing new information to Wikipedia impose a level of formality and enable its coherent and consistent structure. Finally, the requirement for new articles to be collectively deemed "important enough," ensures Wikipedia's distance from the immediate.
Communicative CM is in many aspects the opposite of cultural CM. Analogously to an individual's short-term memory, communicative CM is mainly orally negotiated, close to the everyday, disorganized, informal, and non-specialized (Assmann & Czaplicka, 1995). Online text streams fit this notion of orally negotiated memory: the rapid pace and high volume at which content is published by news websites and social media platforms means that-as opposed to the carefully curated and edited nature of Wikipedia-online text streams are close to the everyday: they not only record and reflect the actions of everyday life but also have a role in producing everyday life for a media-enabled public (Tierney, 2013, p. 33). With the advent of Web 2.0, and the ability for anyone to publish content on the web, online text streams have naturally become disorganized, informal, and non-specialized.
We study the evolution of collective memory by tracking additions to our online cultural CM, Wikipedia. Specifically, we study real world entities 1 as they emerge in online text streams, and are subsequently added to Wikipedia. Every day, new content is being added to Wikipedia, with the knowledge base receiving over 6 million monthly edits at its peak (Suh, Convertino, Chi, & Pirolli, 2009). Domain experts may find information missing on Wikipedia and take up the task of contributing this new information. Alternatively, new, previously unheard-of entities may emerge in news articles or social media postings that describe or comment on events, e.g., the Olympics may introduce new athletes onto the world stage, or the opening of a new restaurant may be reported in local news media and appear in social media. Studying entities that transition from public discourse into Wikipedia gives us insights into how collective memory evolves-for the first time, the online world allows us to make such observations at scale.
To study emerging entities, we analyze entities in a sample of online social media and news text streams spanning over 18 months. We focus on the entities' emergence patterns, i.e., how an Figure 1: Emergence pattern of the entity Curiosity (Rover), first mentioned in our text stream in October 2011. The Wikipedia page for Curiosity was created nine months later, on August 6, 2012. There are two distinct bursts, one late November 2011, the second shortly before the entity is added to Wikipedia. The two bursts correspond to the Mars Rover's launch date (November 26, 2011) and its subsequent landing (August 6, 2012). entity's exposure evolves between its first mention in online text streams, and the moment it is added to Wikipedia. We define an entity's emergence pattern to be its "document mention time series," i.e., the time series that represent the number of documents that mention the entity per day, 2 starting at its first mention in the stream, until it is incorporated into Wikipedia. An example time series is shown in Figure 1, with the number of documents that mention Curiosity on the y-axis (the emergence volume) and the time span between the entity's first mention in online text streams and the day it is added to Wikipedia on the x-axis (the emergence duration).
The main findings of this paper are as follows. By clustering entity's emergence patterns, we find two kinds of regularity: entities that show a strong early burst around the time of their introduction into public discourse, and late bursting entities that exhibit a more gradual emergence. Furthermore, we find meaningful differences between how entities emerge in social media and news streams: entities that emerge in social media tend to transition more slowly from communicative CM to cultural CM than those that emerge in news streams. Finally, we show how different entity types exhibit different emergence patterns; the fastest emerging entities are types that know shorter life-cycles such as devices (e.g., smartphones), and "cultural artifacts" (e.g., movies and music albums).
Wikipedia was first dubbed a global memory place where collective memories are built by Pentzold (2009), with follow-up studies by Keegan (2011) and Ferron and Massa (2011). As Ferron (2012, p. 23) puts it, "Wikipedia's processes of discussion and article construction can be seen as the discursive formation of memory, or in other terms, as the transition from communicative memory, which is interactive, informal, nonspecialized, reciprocal, disorganized and unstable, to cultural memory, which is formal, well organized and objective" (our italics).
In the context of online collective memory, studies have revolved around automated methods for analyzing texts, e.g., studying temporal expressions in web documents has shown that we tend to remember the "near past" online (Au Yeung & Jatowt, 2011). Wikipedia viewership statistics have provided insights into how current events trigger remembrance patterns of past events (Garcìa-Gavilanes, Mollgaard, Tsvetkova, & Yasseri, 2017). Other sources used for online collective memory studies include search engine query logs (Campos, Dias, & Jorge, 2011) and microblog services (Jatowt, Antoine, Kawai, & Akiyama, 2015).
Our work differs from previous work on collective memory in two important ways. We are the first to empirically study the transition from communicative CM to collective CM in terms of the entities that are mentioned in news and social text streams, before being included in Wikipedia. And we are the first to empirically study this transition at scale and across text streams and entity types, signifying an important difference from case studies that involve dramatic or traumatizing events, characteristic of the study of "collective memory" (Lipsitz, 2001;Neal, 1998).

Growth and development of Wikipedia
Previous work on studying the expansion of Wikipedia through the addition of new pages usually studies the phenomenon from the perspective of Wikipedia itself, e.g., by analyzing how newly created articles fit in Wikipedia's semantic network, studying the relation between activity on talk pages and the addition of new content to articles, or by studying controversy and disagreement on new content through "edit wars" (Kämpf, Tessenow, Kenett, & Kantelhardt, 2016;Keegan, Gergle, & Contractor, 2013;Yasseri, Sumi, Rung, Kornai, & Kertész, 2012).
Emerging entities have emerged as object of study in the natural language processing and information retrieval communities. Different methods for identifying and linking unknown or emerging entities have been proposed (Hoffart, Altun, & Weikum, 2014;Lin, Mausam, & Etzioni, 2012;Nakashole, Tylenda, & Weikum, 2013;Voskarides, Odijk, Tsagkias, Weerkamp, & de Rijke, 2014). Graus, Tsagkias, Buitinck, and de Rijke (2014) study the problem of predict-ing new concepts in social streams. Färber, Rettinger, and El Asmar (2016) study the specific challenges and aspects that come with linking emerging entities, while Reinanda, Meij, and de Rijke (2016) study the problem of identifying relevant documents for known and emerging entities as new information comes in, and Graus, Tsagkias, Weerkamp, Meij, and de Rijke (2016) present a method for updating representations based on newly identified information. Our work differs from the aforementioned studies in being observational in nature and its focus on temporal patterns.

Research Questions
In studying emergence patterns of entities, we apply different methods of grouping entities. First, we apply a burst-based unsupervised hierarchical clustering method to group entities by similarities in their emergence patterns. This allows us to answer the following question: RQ1 Are there common patterns in how entities emerge in online text streams?
Next, we examine emerging entities in different types of text stream, viz. news and social media streams. In addition, we study the cross-pollination between the two types of streams, i.e., we study whether entities appear first in either of the streams, or whether they simultaneously appear in both. We answer the following question: RQ2 Do news and social media text streams exhibit different emergence patterns?
Finally, we characterize the emergence patterns of different types of entities. We leverage DBpedia, the structured counterpart of Wikipedia, to group emerging entities by their types, e.g., companies, athletes, and video games. We answer the following question: RQ3 Do different types of entities exhibit different emergence patterns?

Data and Methods
Our dataset spans 7.3 million time-stamped documents, with 36.2 million references to n = 79, 482 unique emerging entities, i.e., entities that did not have a Wikipedia entry at the time they were first mentioned in the corpus, but that did have one by the time the last document in the corpus was published.
We create our custom dataset by extending the TREC-KBA StreamCorpus 2014 3 with an additional set of annotations to Freebase entities (FAKBA1 4 ). We then enrich the FAKBA1 dataset with links to Wikipedia, including for each link (i) the creation date of the associated Wikipedia page, and (ii) whether the Wikipedia page existed at the time the document was created. To encourage further research in emerging entities, we publicly release the tools needed to acquire the dataset used in this paper. 5

OOKBAT Dataset
Our custom dataset is based on the TREC KBA StreamCorpus 2014, which comprises roughly 1.2 billion timestamped documents from global public news wires, blogs, forums, and shortened links shared on social media. It spans 572 days (October 7, 2011-May 1, 2013).
All (English) documents in the StreamCorpus have been automatically tagged for named entities with the Serif tagger (Boschee, Weischedel, & Zamanian, 2005), yielding roughly 580M tagged documents. Dalton, Frank, Gabrilovich, Ringgaard, and Subramanya (January 2015) further automatically annotated these 580M documents with Freebase entities, resulting in the Freebase Annotations of TREC KBA 2014 Stream Corpus (FAKBA1) dataset, which spans over 394M documents (Table 1, line 2). Because the Freebase used in FAKBA1 is dated after the Stream-Corpus timespan, we can identify entities that appear in documents prior to being incorporated in Wikipedia.
We take an entity's Wikipedia page creation date to be its time of transitioning from communicative to cultural CM. To extract Wikipedia page creation dates for the Freebase entities present in FAKBA1, we leverage the available Wikipedia-mappings in Freebase. We then append the Wikipedia page creation dates (or entity timestamp, denoted e T ) to each entity in the FAKBA1 dataset. In addition, we include the entity's "age" relative to the document timestamp (doc T ): the period in days between e T and doc T , i.e., e age = e T − doc T . The resulting dataset, FAKBA1, extended with the entity age and entity timestamp, is denoted Freebase Annotations of TREC KBA 2014 Stream Corpus with Timestamps (FAKBAT) ( Table 1, line 3).
We retain only documents that contain entities with e age < 0, i.e., emerging entities that are mentioned in documents dated before the entity's Wikipedia creation date. We denote the resulting subset of FAKBAT documents with emerging entities Out of Knowledge Base Annotations (with) Timestamps (OOKBAT) ( Table 1, line 4).
To study an emerging entity's emergence patterns, we take two additional filtering steps. First, we prune entities with creation dates more recent than the last document in our stream, to ensure the entities emerged in the timespan of our document stream. Next, we prune all entities that are mentioned in fewer than 5 documents. This yields our final dataset, which comprises 79,482 emerging entities (Table 1, line 5).

Entity types
To study entity types for RQ3, we map emerging entities to their respective classes assigned in the DBpedia ontology, 6 e.g., the entity Barack Obama is mapped to the Person, Politician, Author, Award Winner classes. Out of the 79,482 emerging entities in our dataset, we have 39,713 class-mappings (a coverage of 50.0%).

Entity popularity
As a proxy for an entity's popularity, we extract Wikipedia pageview statistics. We extract the total number of pageviews each entity received during 2015. We choose to use the pageview Table 1: Descriptive statistics of our dataset acquisition. Coverage over preceding dataset in brackets. Looking at the second and third row in the table, we note that roughly two-thirds of the FAKBA1 entities can be mapped to Wikipedia. However, this portion represents 98% of the mentions. The missing one-third were Freebase entities that had no links to Wikipedia, most notably, WordNet concepts and entities from the "MusicBrainz" knowledge base (i.e., artists, albums, and artists). The last two rows show that one in ten of the entities emerge during the span of the dataset, however, they constitute a mere 1% of the mentions. counts of a year that falls outside of the timespan of our dataset so as to minimize the effects of timeliness.

Time series clustering
The core unit in our analysis are so-called emergence patterns, i.e., time series that represent the number of documents that mention an entity over time. To answer our first research question, Are there common patterns in how entities emerge in online text streams? (RQ1), we apply clustering to group entities with similar emergence patterns. Clustering time series consists of three steps: First, we normalize the time series, as they might span very different periods of time. Next, we measure the similarities between time series. And third, we apply hierarchical agglomerative clustering.

Normalization
The entities' time series we study here are characterized by several properties. First, they are of variable length: some entities may take days to transition, others take months. Second, the time series in our dataset are temporally unaligned: each time series starts at the timestamp of the article that first mentions the entity, and ends at the entity's Wikipedia page creation date.
To be able to visualize the time series of variable lengths, we linearly interpolate the time series to have equal length (Rani & Sikka, 2012). Furthermore, since we are not interested in the absolute differences in document volumes or entity popularity when visualizing the time series clusters, we standardize our time series by subtracting the mean and dividing by the standard deviation (Vlachos, Meek, Vagena, & Gunopulos, 2004).

Similarity
Typically, time series similarity metrics rely on fixed-length time series, and may leverage seasonal or repetitive patterns (Liao, 2005). Since our time series are of variable length, and not temporally aligned, common time series similarity metrics such as Dynamic Timewarping (DTW) are not suitable (Berndt & Clifford, 1994). Furthermore, we are interested in periods in which the exposure of an entity in public discourse increases or changes. These "bursts" may be correlated to real-world activity and events around the entity. To address the nature of the time series, and our focus on bursts we employ BSim (Vlachos et al., 2004) (Burst Similarity) as our similarity metric. BSim relies on measuring the overlap between bursts of different time series. To detect these bursts we compute a moving average of each emerging entity time series (T e ), denoted T M A e . We set parameter w (the size of the rolling window) to 7 days. Bursts are the points in T M A e that surpass a cutoff value (c). We set c = 1.5 · σ M A , where σ M A is the standard deviation of MA. The parameter choices for w and c are in line with previous work (Vlachos et al., 2004). Figure 2 shows an example time series (T e ), with the bursts detected for the previously shown Curiosity (rover). The detected bursts correspond to the earlier mentioned launch and landing of the Mars rover.

Hierarchical agglomerative clustering
To cluster time series, we compute pair-wise similarities between all time series, and yield Similarity Matrix SM. We then apply L 2 normalization to SM, and convert it to a distance matrix DM (DM = 1 − SM). Finally, we apply hierarchical agglomerative clustering (HAC) on DM using the fastcluster package (Müllner, 2013). As our linkage criterion, we employ Ward's method (Ward Jr., 1963).

Analysis
Given our grouping methods (clustering, by entity type, by stream type), we apply two methods to analyze the resulting groups of emerging entities: (i) visualization of group signatures, and (ii) descriptive statistics that reflect properties of the underlying time series.

Visualization
To compare groups of emerging entities, we visualize their so-called group signatures, i.e., the average of all time series that belong to a group. See Figure 4 for an example group signature of all emerging entities in our dataset (n = 79,482). As described above, the time series (may) differ in length, and are not temporally aligned. To visualize the time series, we linearly interpolate each to the (overall) longest emergence duration, effectively "stretching" them to have equal length. Next, we align them in relative duration, i.e., we overlay each entity's first and last mention at the start and end of the x-axis, respectively.

Descriptive statistics
Visualizing group signatures does not paint the full picture. More fine-grained aspects of emergence, e.g., the average emergence duration (the time between an entity's first mention in the text stream, and its subsequent incorporation into Wikipedia), or emergence volume (the number of documents that mention the entity before it is incorporated into Wikipedia) disappear through our visualization method. To study these aspects, we describe the time series groups using different features that reflect the emergence behavior of the group. For an overview of the descriptive statistics that we consider, see Table 2. For clarity, the tree is truncated by showing no more than 7 levels of the hierarchy.

Results
In this section we present the analyses that answer our research questions. Figure 3 shows a cluster tree that results from clustering the time series distance matrix. At its highest level, the tree shows two distinct clusters, each of which is broken down into multiple smaller sub-clusters. In the following section, we study the global emergence patterns, by taking all time series at the root of the tree (Top level in Figure 3), and next, the two main clusters (Level 1 in Figure 3).

RQ1: Emergence patterns
Global emergence pattern Figure 4 shows how both the emerging entities' introduction into public discourse (the first mention at the left-most side of the plot) and subsequent incorporation into cultural CM (the rightmost side of the plot) occur in bursts of documents, i.e., overall, the largest number of documents that mention a newly emerging entity are either at the start or at the end of their time series. This can be explained as follows. The entrance into public discourse represents the first emergence of an entity, whereas being added to cultural CM is likewise likely to happen in a period of increased attention, e.g., a real world event that puts the entity in public discourse. Between these two bursts, the number of documents that mention the entity seems to increase gradually as time progresses, suggesting that on average, the number of documents that mention a new entity, and thus the attention the entity receives in public discourse increases over time before it reaches "critical mass." Turning to the descriptive statistics in Table 3, it takes 245 days on average for an entity to emerge, but with large variations between entities, motivating our clustering approach. On average, an entity is associated with multiple bursts (3.8), indicating that entities are likely to resurface multiple times in public discourse before being deemed important enough to transition into cultural CM.  Clusters at level 1 in Figure 3: Early vs. late bursts In our first attempt at uncovering distinct patterns in which collective remembrance happens, we study the two main clusters at Level 1 of the cluster tree ( Figure 3). The resulting cluster signatures are shown in Figure 5. Much like the global cluster signatures in the previous section, the Level 1 clusters show two main bursts: the initial burst around the first mention, and the final burst around the time an entity is added to Wikipedia. Howeve, the left cluster, which we call early bursting (EB) entities, is characterized by a stronger initial burst, with the majority of the documents that mention the entity concentrated at the time when the entity surfaces in communicative CM. This suggests that the cluster contains new entities that suddenly emerge and experience a (brief) period of lessened attention, before transitioning into the collective's CM. The right cluster, which we denote as late bursting (LB) entities, shows a more gradual pattern in activity towards the point at which the entity is incorporated into cultural CM, much like we saw in the global signature.
We note two main differences between the group signatures of the EB and LB entities in Figure 5. First, the distribution of documents between the initial and final burst. The EB entities show a more "abrupt" final burst: the majority of the documents are in the wake of the initial burst, i.e., at the left-hand side of the plot, then, the document volume gradually winds down, before it finally seems to abruptly transition into the final burst at the right hand-side of the plot. In contrast, the LB entities cluster shows a relatively subtle initial burst, which likewise quiets down, followed by a gradual increase of document volume that leads up to the final burst.
A second difference is the height difference between the initial and final bursts. The EB cluster shows roughly equally high initial and final bursts; the LB cluster shows a substantially smaller initial burst, which suggests the introduction into public discourse is more subtle than its addition to Wikipedia.
We turn to the clusters' descriptive statistics in Table 4. We first test for statistical significance in the differences between the cluster statistics. We perform a Kruskal-Wallis one-way analysis of variance test, and follow this omnibus test with a post-hoc test using Dunn's multiple comparison test (with p-values corrected for family-wise errors using Holm-Bonferroni correction). We find that all differences are statistically significant at the α = 0.05 level. Table 4 shows LB entities emerge more slowly (259 days) than EB entities (224 days). LB entities also receive more exposure during emergence (225 vs. 118 documents for EB entities). The shorter emergence duration and lower volumes seen with the EB entities suggest they represent more popular, timely, or "urgent" entities, that will be incorporated quickly after emerging in public discourse, e.g., large-scale events and popular entities. The descriptive statistics of the LB entities on the other hand suggests less timely or urgent entities. The burst statistics confirm this view of slower, less timely LB entities, and more urgent, faster EB entities: the average burst heights of EB entities are higher, suggesting LB entities see a more evenly spread volume of documents that mention them. Furthermore, EB entities show fewer bursts (3.22 vs. 4.12, on average).
And indeed, the EB entities that occur most frequently in our dataset include many "central" entities related to popular culture, e.g., products such as Xbox One (121,813 mentions), movies, e.g., The Twilight Saga: Breaking Dawn -Part 2 (124,222 mentions), and news events, e.g., Disappearance of Lisa Irwin (15,917 mentions). The most frequent LB entities on the other hand include more obscure, long-tail, or niche entities: most notably people, e.g., Jeffrey Chiesa (31,560 mentions), Sergio Ermotti (22,274 mentions), and James Rolfe (filmmaker) (15,797 mentions).

Summary
In this section, we have answered our first research question: "Are there common patterns in how entities emerge in online text streams?" We performed hierarchical clustering using a burst similarity-metric of the emerging entity time series and discovered two distinct emergence patterns: early bursting entities and late bursting entities. Our visual inspection of the cluster signatures suggest LB entities emerge more slowly, i.e., build up attention more slowly before transitioning from communicative into cultural CM, whereas EB entities are associated with more sudden and higher bursts of activity, prior to transitioning into cultural CM. We find that the two clusters differ substantially and significantly in their cluster signature and descriptive statistics.

RQ2: Emerging entities in social media and news
In this section we answer our second research question: "Do news and social media text streams exhibit different emergence patterns?" In the previous section we have shown that 79,482 entities emerge in the combined news and social media streams. By splitting out these entities by stream, we find 51,095 of these entities emerge in the news stream (i.e., are mentioned in the news stream), similar to the number of entities that emerge in the social media stream, at 51,356. Finally, 30,148 of the emerging entities are mentioned in both streams before being incorporated into Wikipedia.

Global: News vs. social
First, we compare the emergence patterns of entities in news and social streams. We apply the same hierarchical clustering method from the previous section on the two subsets of entities that emerge in news and social media streams (where n news = 51,095 and n social = 51,356). Figure 6 first shows the global emergence patterns (top row, in green), which are largely the same in the two streams and highly similar to the global patterns studied in the previous section. The bottom two rows of Figure 6 show that both streams exhibit groups that are similar to the early bursting and late bursting entities shown in Figure 5. Looking at the top row, however, shows that entities that emerge in news have slightly more of their emergence volume mass after the initial burst, compared to the corresponding pattern of the social media stream, which exhibits a more gradual increase in emergence volume towards the final burst at the right-hand side of the plot. This may be attributed to the slightly higher proportion of early bursting entities in the news stream, which has 50.0% of its entities falling in this cluster, while the social media stream has 48.6%.

Who's first?
Of the 79,482 entities that emerge in the 18 month period our dataset spans, 45,678 appear in both the news and social media stream before they transition to cultural CM; 9,681 entities are mentioned exclusively in the news stream, and never appear in social media (news-only) before transitioning into cultural CM. Finally, 23,096 appear only in the social media stream (social-only). In Table 5, we compare entities as they emerge in different streams.
Of the 45,678 entities that emerge in both streams, the majority appears in the social media stream before they appear in the news stream. This may be explained by the different nature of the publishing cycles of the two streams; whereas news stories need to be checked and edited before being published, social media follows a more unedited and direct publishing cycle.
The entities that appear in social media first (social-first), cover 64.9% (n = 29,665) of the entities that emerge in both streams. Interestingly, entities that emerge in news first, subsequently appear in social media streams slower than vice versa: on average 66 days for the former, and 49 days for the latter. A relatively small number of entities is mentioned in both streams on Table 5: Emergence features for our five groups of entities: entities that emerge in both streams, but first in the news stream (news-first), entities that emerge in both streams, but first in the social media stream (social-first), entities that emerge in both streams on the same day (same-time), entities that emerge only in the news stream (news-only), and finally, entities that emerge only in the social media stream (social-only). stream duration (#days) volume (#docs) velocity (docs/day) mean ± std med. mean ± std med. mean ± std med. the same day (sametime): 8.7% (n = 3,967). Such entities are expected to being more urgent and central, as they appear more widely in public discourse. This group's shortest emergence durations and highest velocities, support this view of entities that play a more central role in public discourse. And indeed, looking at the entities that appear in this set, we see a large number of news events-related entities, e.g., 12-12-12: The Concert for Sandy Relief, 2013 Alabama bunker hostage crisis, and Suicide of Jacintha Saldanha.

Summary
News and social media streams show broadly similar emergence patterns for entities but the population and the behavior of entities emerging in news and social differ significantly. Entities are slower on average in emerging in social media streams, and entities that appear in both streams on the same day are the fastest to transition to cultural CM.

RQ3: Emergence patterns of different entity types
In this section we answer our third research question: "Do different types of entities exhibit different emergence patterns?" We compare the descriptive statistics of different entity types in our dataset, to assert whether different types exhibit different emergence patterns.

Entity types: temporal patterns
First, we study the descriptive statistics per entity type. Table 6 provides an overview of the most frequent entity types in our dataset (i.e., all entity types with a frequency of ≥ 400). We find that the entity type signatures are very similar to the global pattern of Figure 4, which suggests the time series are highly variable within an entity type. See Figure 7 for an example of two common entity types (top row) and two less frequently emerging types (bottom row): whereas the signature becomes smoother as the number of mentions increases, the overall pattern is highly similar across the four types. Turning to Table 6, we note the null class, i.e., entities that are not assigned an entity type in DBpedia exhibit very low emergence volumes (98 documents on average). This may be explained by their nature: long-tail, or unpopular entities are more likely to not have a class assigned in the DBpedia ontology.
Second, we note a group of "fast" emerging entity types with short emergence durations and/or high emergence velocities, e.g., DesignedArtifact, CreativeWork, Musical-Work, and VideoGame, consider, e.g., the DesignedArtifacts emergence velocity, at 217 days with over 7 documents a day on average. This type includes entities such as devices and products, e.g., smartphones, tablets, and laptops. The relatively fast transitioning speed may be explained by their nature: they have short "life-cycles" and may be superseded or replaced at high frequencies. Consider, e.g., the release or announcement of a new smartphone: this event typically generates a lot of attention in a short timeframe, which may result in a fast transition into cultural CM. Similar to the DesignedArtifact-type, CreativeWorks (including, e.g., MusicalWork, WrittenWork, Movie) share this characteristic: they play a central but short-lived role in public discourse.
Third, the "slower" entities, i.e., those with longer emergence durations and lower emergence volumes, are largely person types such as Writer, Artist, and political figures (Off-iceHolder), but also School and EducationalInstitution, and geographical entities (e.g., Building, ArchitecturalStructure, Place, and PopulatedPlace). These entities may have longer life-cycles and a more gradual "rise to fame" by their nature, and have a less central role in public discourse. Consider, e.g., politicians who generally have a long and gradual career and are more likely to first emerge in local media. Similarly, an opening of a new school building may emerge in regional news, but is unlikely to be globally and widely reported. To better understand the difference between "fast" and "slow" entities, we examine the popularity of entities. Table 7 lists the average number of pageviews received per entity in 2015, per type. Looking at the ranking, we note how "faster" emerging entity types remain more popular over time: types that are associated with short emergence durations and high velocities all fall in the top 10 (ranks 3, 4, and 9, for VideoGame, CreativeWork, and DesignedArtifact, respectively), whereas slower types reside in lower ranks in the table, e.g., rank 19, 22, and 24 for Building, EducationalInstitution and School, respectively.

Summary
We have shown that different entity types exhibit substantially different emergence patterns, but entities that belong to a particular type show broadly similar emergence patterns. Furthermore, entities with a fast transition from communicative to cultural CM, are more likely to remain popular over time.

Conclusion
In this paper we studied entities as they transition from communicative into cultural collective memory. We did so by studying a large set of time series of mentions of entities in online news streams before transitioning into cultural CM (as represented by the creation of a Wikipedia page). We studied implicit groups of similarly emerging entities by applying a burst-based agglomerative hierarchical clustering method and explicit groups by isolating entities by whether they emerge in news or social media streams.

Findings
We found that, globally, entities have a long time span between surfacing in communicative CM and transitioning into cultural CM. During this time span, an entity may emerge with multiple bursts, however both the entities' introduction into public discourse, and subsequent transitioning into cultural CM occur in the largest document bursts. Emergence statistics show large standard deviations, indicating that they differ substantially between entities. For this reason, we turned to time series clustering to uncover distinct groups of entities. We discovered two emergence patterns: early bursting (EB) entities and late bursting (LB) entities. Analysis suggests that EB entities comprise mostly "head" or popular entities; they exhibit fewer and higher bursts, with shorter emergence durations and lower emergence volumes. The LB entities emerge more slowly, and witness a more gradual increase of exposure before transitioning into cultural CM. The emergence patterns we visualized differ substantially from the global average and from, e.g., the type signatures shown in Figure 7, suggesting that the entities in each of the underlying clusters exhibit substantially different and distinct emergence patterns from entities in the other clusters. We showed that entities emerging in news and social media streams display very similar emergence patterns, but that on average, entities that emerge in social media take longer to be incorporated into cultural CM. We hypothesize that this can be attributed to the nature of the underlying sources. News media are more mainstream and professional, with a larger audience, reach, and authority, than social media. Our findings are in line with those of Petrovic, Osborne, Mccreadie, Macdonald, and Ounis (2013), who compare breaking news on traditional media with that on social media. Their findings suggest reported events overlap largely between both media, however, social media exhibits a long tail of minor events, which may explain the longer uptake on average. Leskovec, Backstrom, and Kleinberg (2009) find that the "attention span" for news events on social media both increases and decays at a slower rate than for traditional news sources, which additionally supports our observations of the slower uptake on social media.
Finally, we showed that different entity types exhibit substantially different patterns, but entities of similar types show similar patterns. Some entity types, e.g., devices or creative works, transition faster from communicative to cultural CM, than entities such as buildings, locations, and people. At the same time, the former "fast" entity types remain more popular over time.
One aspect that distinguishes between "fast" and "slow" entity types, is that the former are more likely to appear in so-called "soft news" that covers sensational or human-interest events and topics (e.g., news related to celebrities and cultural artifacts). The slower entity types on the other hand, are more likely to appear in more substantive "hard news" that encompasses more urgent events and topics (e.g., political elections) (Tuchman, 1972). Granka (2010) studied the differences in "attention span" of the public and the traditional news media for "hard" and "soft news," and found that hard news is associated with a relatively shorter period of public attention. Soft news exhibits a slower decrease of the public's attention, which supports our finding that faster entity types (more likely associated with soft news) tend to remain more popular over time.
As emerging entities are not "born equal." The patterns and circumstances under which an entity transitions from communicative to cultural CM differ depending on source and type.

Implications
Our findings have implications for designing systems to detect emerging entities, and more generally for studying and understanding how collective memories are formed. We show that entities are likely to resurface multiple times in public discourse before transitioning into cultural CM. This suggests that monitoring bursts of new entities could prove effective for predicting the for-mation of collective memories. Furthermore, we show that the type of stream in which entities emerge shows different patterns. This suggests that taking the different nature of streams into account can be beneficial for predicting emerging entities. Finally, we show that different types of entity exhibit different emergence patterns, suggesting the underlying entity type could likewise prove valuable in predicting emerging entities.

Limitations
Part of our findings are derived from an unsupervised clustering method. Interpreting cluster signatures is a subjective matter, and clustering is a difficult task to evaluate (Von Luxburg, Williamson, & Guyon, 2012). In our defense, the clustering's dendrogram suggests the presence of distinct and meaningful groups, as the structure of the dendrogram shows symmetry and clear separations. More importantly, the cluster signatures yielded visually discernible, and different patterns between clusters, which was not the case for the signatures of other grouping strategies in this paper (see, e.g., Figure 7).
The fragmented nature of the source of our dataset (TREC-KBA StreamCorpus 2014) means that coverage, and hence representativeness of the data cannot be guaranteed. Popular social media channels such as Tumblr, Twitter and Facebook are not part of the dataset, there may be a sampling bias in the sources that represent the streams, resulting in a similar bias in the entities. Different sources may well yield different findings. This is unavoidable.
Another limitation relates to the entity annotations used as a starting point in this paper: they cannot be assumed to be 100% accurate. So-called "cascading errors" (Finkel, Manning, & Ng, 2006) in NLP pipelines cause the accuracy of downstream tasks to suffer, in our case having imperfect tags (named entities) for imperfect tagging (entity linking). The FAKBA1 annotations are estimated (from manual inspection) to contain around 9% incorrectly linked entities, with around 8% of SERIF mentions being wrongfully not linked. Even more so, the "difficult" entities are long-tail entities, which are more likely to be part of our filtered set. However, manually correcting the annotations was beyond the scope of this study.
Finally, there may be a cultural bias inherent in our choice of datasets: we used English language news sources and social media as well as the English version of Wikipedia. One could object that we studied the birth of collective memories for the English-speaking part of the world, and that different datasets may also yield different findings. It is unfortunate that the English-speaking part of the world is disproportionately represented in our field of research, as is witnessed by the biggest constraint in conducting this study: dataset availability. We invite the community to create suitable datasets in other languages or reflecting cultural practices in other parts of the planet so as to enable comparative studies.

Future work
As a next step, we should take a closer look at the circumstances in which entities emerge, by not exclusively considering in how many documents they appear over time, but also in which contexts, e.g., by looking at the content of the articles themselves. Furthermore, in this paper we have chosen to restrict ourselves to the entities that transition and remain in cultural CM.
Another interesting aspect of CM, is the notion of "consensus." For example, one could study the emergence patterns of entities that are removed from cultural CM after transitioning. Finally, the observations made in this paper could be explored in a prediction task, where, e.g., given a partial entity time series, the task would be to predict the point at which the entity transitions from communicative to cultural CM.