Data sets

Base datasets

These datasets include information about Named Entities extracted from Wikipedia. Information are available for the classes PERSON, ORGANIZATION and LOCATION. For each class, two type of files are available:

  1. information about the entity (NE files)
  2. links to other openly accessible databases (LINK files)

These files were extracted using the NECKAr tool.

Version 2

Wikidata dump 2017/03/20, NECKAr v1.0

complete download (484.7 MB, 3 files: all entities, all en links, all de links)

Named Entities (NE)

Named Entities number of entities details
Persons (189 MB) 3,416,903 more
Organizations (30 MB) 1,002,937 more
Locations (116 MB) 4,842,665 more
all entities (336 MB)   9,262,505

Links

Links EN # entries EN Links DE # entries DE details
Persons_Links (53 MB) 1,459,196 Persons_Links (29 MB) 642,373 more
Organizations_Links (22 MB)  525,353 Organizations_Links (11 MB) 211,148 more
Locations_Links (38 MB) 989,767 Locations_Links (17 MB) 388,634  more
all_Links (107MB) 2,974,316 all_Links (54 MB) 1,242,155

Previous Versions

Previous versions of the base data sets are available here.

Event datasets

Event Triples from Wikipedia

This is a collection of event triples that were extracted from the English Wikipedia dump of May 2015. Entities are of type actor, location, or date. Actors and locations were annotated by matching Wikipedia links to Wikidata items. Dates were annotated using Heideltime. Triples were constructed from co-occurrences of entities within a window of at most 3 consecutive sentences. An example of an event triple is shown in the following image:

bildschirmfoto-2017-02-20-um-11-15-31

read more

Event Triples from News Article

This is a collection of event triples that were extracted from English news outlets and news outlets that publish in the English language. For this collection, news articles for the time period 2016-01-01 to 2016-12-31 were retrieved. Entities are of type actor and location and are automatically annotated in the text of the news articles with the Stanford NER toolset; dates were annotated using Heideltime. Triples were constructed from co-occurrences of entities within a window of at most 3 consecutive sentences.

read more

Networks

Wikipedia Social Network

The Wikipedia social network is a weighted, large-scale network that is constructed from person mentions in the English Wikipedia corpus.

Version Wikipedia files number nodes number edges total file size
1.0 EN 2015/01/12 4 appr. 800k appr. 70M 963 MB compressed
2.7 GB uncompressed

read more

Wikipedia Location Network

The Wikipedia location network is a weighted, undirected large-scale network of roughly 720k nodes and 178M edges that is constructed from toponyms in the English Wikipedia corpus.

Version Wikipedia files number nodes number edges total file size
1.0 EN 2015/06/02 3 appr. 720k appr. 178M 1.4 GB compressed, 6.2 GB uncompressed

read more