This is a collection of event triples that were extracted from the English Wikipedia dump of May 2015. Entities are of type actor, location, or date. Actors and locations were annotated by matching Wikipedia links to Wikidata items. Dates were annotated using Heideltime. Triples were constructed from co-occurrences of entities within a window of at most 3 consecutive sentences. An example of an event triple is shown in the following image:
Information about the extracted event triples is stored in the following format:
|e_doc_id||Wikipedia page ID||25006149|
|e_geo_text||text of extracted geo location (green box)||White House|
|e_geo_id||Wikidata ID of extracted geo location||Q35525|
|e_geo_sent||sentence number on Wikipedia page||4|
|e_actor_text||text of extracted actor (red box)||Barack Obama|
|e_actor_id||Wikidata ID of extracted actor||Q76|
|e_actor_sent||sentence number on Wikipedia page||4|
|e_time_text||text of extracted time (blue box)||July 11, 2013|
|e_time_sent||sentence number on Wikipedia page||4|
The following Heatmap illustrates where the majority of the Wikipedia Events are located. GPS coordinates were extracted from Wikidata via the attribute e_geo_id.
For the normalized time field, e_time_norm, the year distribution is illustrated below. Most of the events are to the time period between 1650 and 2015. However, there are also a few events before the year 1650.
Because of the large size of the whole Event Triples dataset, we also provide the triples in five separate files. There is no particular ordering or organization upon which the partitioning of the complete dataset was conducted:
(463 MB compressed, 10.7 GB uncompressed, 153.091.882 event instances)
(126 MB compressed, 1.5 GB uncompressed, 20.583.186 event instances)
(126 MB compressed, 1.9 GB uncompressed, 25.676.744 event instances)
(168 MB compressed, 2.3 GB uncompressed, 33.217.732 event instances)
(282 MB compressed, 3.9 GB uncompressed, 56.461.433 event instances)
(92 MB compressed, 1.2 GB uncompressed, 17.152.787 event instances)
Further details on the event triples can be found in the research paper listed below. If you use the data for research purposes, please cite this paper:
Refining Imprecise Spatio-temporal Events: A Network-based Approach. A. Spitz, J. Geiß, M. Gertz, S. Hagedorn and K. Sattler. 10th ACM SIGSPATIAL Workshop on Geographic Information Retrieval (GIR), 2016.
[pdf] [acm] [bibtex] [data] [slides]