The Named Entity Classifier for Wikidata (NECKAr) is a tool to assign entities present in Wikidata to the NE classes Person, Location, and Organization.
Many Information Extraction (IE) tasks such as Named Entity Recognition (NER) or Event Detection require background repositories that provide a very simple classification of entities. The
predominately used classes are Location, Person, and Organization. There are several knowledge bases available that offer a very detailed and specific ontology of entities, but these are of limited use to IE approaches.
NECKAr, assigns Wikidata entities to the three main NE classes by exclusively using information present in Wikidata, the resulting Wikidata NE dataset, consists of over 8M classified entities.
Contents
Tool
The current version of the NECKAr tool can be downloaded here:NECKAr_v1.0.tar.gz
Requirements
- Python 3
- MongoDB 3.2 +
- python modules as in requirements.txt
- requests==2.10.0
- pymongo==3.4.0
Installing, Configuring and Running NECKAr
- Download the tar file and unpack it.
- Download a Wikidata dump.
- Install and run MongoDB.
- Add the path to the dump and details about your database to the file NECKAr.cfg.
- Start one of the startNECKAr scripts
If you want to start the complete pipeline (import Wikidata dump to the database, extract the Named Entities, and create LOD links) use the script startNeckar.sh. To run the parts separately use the other scripts provided.
Wikidata NE dataset
The Wikidata NE dataset has two parts: the Named Entity files and the link files.
The Named Entity files include the most important information about the entities, whereas the link files contain the links and ids in other databases. The link files are available for English and German. The English link files include only entities that have a page in the English Wikipedia, these entities are linked to DBPedia.The German link files include only entities that have a page in the German Wikipedia, these entities are linked to the German DBPedia.
The datasets can be downloaded from the following tables (all json.gz files)
Version 2
Wikidata dump 2017/03/20, NECKAr v1.0
http://event.ifi.uni-heidelberg.de/wp-content/uploads/2017/04/WikidataLODLinks_20170320_Persons_NECKAR_1_0.json_.gz
Named Entities | Links en | Links de | number of entities |
Persons (189 MB) | Persons_Links (53 MB) | Persons_Links (29 MB) | 3,416,903 |
Organizations (30 MB) | Organizations_Links (22 MB) | Organizations_Links (11 MB) | 1,002,937 |
Locations (116 MB) | Locations_Links (38 MB) | Locations_Links (17 MB) | 4,842,665 |
all entities (336 MB) | all_Links (107MB) | all_Links (54 MB) | 9,262,505 |
Version 1
Wikidata dump 2016/12/05, NECKAr v1.0
complete download (484.7 MB, 3 files: all entities, all en links, all de links)
Named Entities | Links en | Links de | number of entities |
Persons (179 MB) | Persons_Links (51 MB) | Persons_Links (28 MB) | 3,322,217 |
Organizations (27 MB) | Organizations_Links (20 MB) | Organizations_Links (10 MB) | 939,840 |
Locations (105 MB) | Locations_Links (35 MB) | Locations_Links (16 MB) | 4,527,356 |
all entities (310 MB) | all_Links (103MB) | all_Links (52 MB) | 8,789,413 |
Example
The following examples show one entries for each entity type:
neClass | LOC | ORG | PER | ||
id | Q201117 | Q41187 | Q132345 | ||
norm name | Toyota | Sony | Shinzo Abe | ||
description | city in Aichi Prefecture, Japan | conglomerate corporation | Prime Minister | ||
en Wikipedia | Toyota, Aichi | Sony | Shinzo ̄ Abe | ||
location type | city, settlement | instance of | business enterprise | occupation | politician |
population | 423,343 | CEO | Ryoji Chubachi | gender | male |
continent | Asia | Founder | Masaru Ibuka, Akio Morita | date of birth | 9/21/54 |
continent | Japan | Inception | 5/7/46 | date of death | – |
coordinate | 35.083 137.156 | Headquarter | Sony City | alias | Abe Shinzo |
Country | Japan | ||||
official website | http://www.sony.net/ |
References
If you use NECKAr in your work please cite our paper: