Contents
Base datasets
These datasets include information about Named Entities extracted from Wikipedia. Information are available for the classes PERSON, ORGANIZATION and LOCATION. For each class, two type of files are available:
- information about the entity (NE files)
- links to other openly accessible databases (LINK files)
These files were extracted using the NECKAr tool.
Version 2
Wikidata dump 2017/03/20, NECKAr v1.0
complete download (484.7 MB, 3 files: all entities, all en links, all de links)
Named Entities (NE)
Named Entities | number of entities | details |
Persons (189 MB) | 3,416,903 | more |
Organizations (30 MB) | 1,002,937 | more |
Locations (116 MB) | 4,842,665 | more |
all entities (336 MB) | 9,262,505 |
Links
Links EN | # entries EN | Links DE | # entries DE | details |
Persons_Links (53 MB) | 1,459,196 | Persons_Links (29 MB) | 642,373 | more |
Organizations_Links (22 MB) | 525,353 | Organizations_Links (11 MB) | 211,148 | more |
Locations_Links (38 MB) | 989,767 | Locations_Links (17 MB) | 388,634 | more |
all_Links (107MB) | 2,974,316 | all_Links (54 MB) | 1,242,155 |
Previous Versions
Previous versions of the base data sets are available here.
Event datasets
Event Triples from Wikipedia
This is a collection of event triples that were extracted from the English Wikipedia dump of May 2015. Entities are of type actor, location, or date. Actors and locations were annotated by matching Wikipedia links to Wikidata items. Dates were annotated using Heideltime. Triples were constructed from co-occurrences of entities within a window of at most 3 consecutive sentences. An example of an event triple is shown in the following image:
Event Triples from News Article
This is a collection of event triples that were extracted from English news outlets and news outlets that publish in the English language. For this collection, news articles for the time period 2016-01-01 to 2016-12-31 were retrieved. Entities are of type actor and location and are automatically annotated in the text of the news articles with the Stanford NER toolset; dates were annotated using Heideltime. Triples were constructed from co-occurrences of entities within a window of at most 3 consecutive sentences.
Networks
Wikipedia Social Network
The Wikipedia social network is a weighted, large-scale network that is constructed from person mentions in the English Wikipedia corpus.
Version | Wikipedia | files | number nodes | number edges | total file size |
1.0 | EN 2015/01/12 | 4 | appr. 800k | appr. 70M | 963 MB compressed 2.7 GB uncompressed |
Wikipedia Location Network
The Wikipedia location network is a weighted, undirected large-scale network of roughly 720k nodes and 178M edges that is constructed from toponyms in the English Wikipedia corpus.
Version | Wikipedia | files | number nodes | number edges | total file size |
1.0 | EN 2015/06/02 | 3 | appr. 720k | appr. 178M | 1.4 GB compressed, 6.2 GB uncompressed |