Event Triples from Wikipedia

This is a collection of event triples that were extracted from the English Wikipedia dump of May 2015. Entities are of type actor, location, or date. Actors and locations were annotated by matching Wikipedia links to Wikidata items. Dates were annotated using Heideltime. Triples were constructed from co-occurrences of entities within a window of at most 3 consecutive sentences. An example of an event triple is shown in the following image:


Information about the extracted event triples is stored in the following format:

field  description  value
e_doc_id Wikipedia page ID 25006149
e_geo_text text of extracted geo location (green box) White House
e_geo_id Wikidata ID of extracted geo location Q35525
e_geo_sent sentence number on Wikipedia page 4
e_actor_text text of extracted actor (red box) Barack Obama
e_actor_id Wikidata ID of extracted actor Q76
e_actor_sent sentence number on Wikipedia page 4
e_time_text text of extracted time (blue box) July 11, 2013
e_time_norm normalized time 2013-07-11
e_time_sent sentence number on Wikipedia page 4

The following Heatmap illustrates where the majority of the Wikipedia Events are located. GPS coordinates were extracted from Wikidata via the attribute e_geo_id.


For the normalized time field, e_time_norm, the year distribution is illustrated below. Most of the events are to the time  period between 1650 and 2015. However, there are also a few events before the year 1650.


Because of the large size of the whole Event Triples dataset, we also provide the triples in five separate files. There is no particular ordering or organization upon which the partitioning of the complete dataset was conducted:


(463 MB compressed, 10.7 GB uncompressed, 153.091.882 event instances)

Individual files:

(126 MB compressed, 1.5 GB uncompressed, 20.583.186 event instances)

(126 MB compressed, 1.9 GB uncompressed, 25.676.744 event instances)

(168 MB compressed, 2.3 GB uncompressed, 33.217.732 event instances)

(282 MB compressed, 3.9 GB uncompressed, 56.461.433 event instances)

(92 MB compressed, 1.2 GB uncompressed, 17.152.787 event instances)

Further details on the event triples can be found in the research paper listed below. If you use the data for research purposes, please cite this paper:

Refining Imprecise Spatio-temporal Events: A Network-based Approach. A. Spitz, J. Geiß, M. Gertz, S. Hagedorn and K. Sattler. 10th ACM SIGSPATIAL Workshop on Geographic Information Retrieval (GIR), 2016.
[pdf] [acm] [bibtex] [data] [slides]