Public knowledge graphs and SPARQL endpoints

Public datasets, SPARQL endpoints, and reusable vocabularies referenced throughout the curriculum. Most of these I’ll query rather than download.

Last updated: Pre-launch, 5/23/26 · Companion files: reading-list.md · tools.md


Contents


Public SPARQL endpoints

Wikidata Query Service

The largest publicly-queryable knowledge graph. Crowd-curated; contains structured data about millions of entities — people, places, books, films, scientific concepts.

Why it matters: Wikidata is where to go to practice SPARQL on real data. The query helper UI is forgiving — start by reading and modifying example queries before writing from scratch.

Used in: Module 1 Exercise 1.1; Module 4 federation experiments.

Practical tips:

DBpedia

Structured data extracted from Wikipedia. Predates Wikidata; still useful for English-language Wikipedia-anchored queries. Less curated than Wikidata but sometimes faster.

Bio2RDF

Life sciences linked data. Aggregates many biomedical sources (UniProt, KEGG, etc.) into a single SPARQL-queryable graph. Useful as an example of a successful domain-specific semantic web deployment.

LinkedGeoData

Geographic data from OpenStreetMap, available as Linked Data with a SPARQL endpoint.


Reusable vocabularies

When building an ontology, reuse before you define. These are the vocabularies the curriculum points to repeatedly.

schema.org

Originally a SEO vocabulary backed by Google, Bing, Yahoo, and Yandex; now a general-purpose vocabulary for describing things on the web. Has classes for almost every common entity type: Person, Organization, Event, Book, TVEpisode, etc.

Used in: Modules 1, 2, 4 — used throughout for Person, Organization, Event, DateTime, TVEpisode.

FOAF (Friend of a Friend)

The classic vocabulary for describing people and their connections. Older than schema.org but still standard for many use cases.

Used in: Module 1 — foaf:Person, foaf:knows.

Dublin Core

Bibliographic and resource description metadata: title, creator, date, publisher, source, etc. The vocabulary to reach for when describing things about a thing.

Used in: Module 3 reification work — dcterms:source, dcterms:date.

SKOS (Simple Knowledge Organization System)

The W3C vocabulary for concept schemes — controlled vocabularies, taxonomies, thesauri. The right choice when you have a hierarchy of concepts that isn’t a strict class hierarchy.

Used in: Module 1 (ESCO skills), Module 3 (alignment via skos:exactMatch, skos:closeMatch).

PROV-O

The W3C provenance ontology. For describing who, when, where, and how an assertion came to exist.

Used in: Module 3 reification project (Naruto provenance for contested canon).

OWL Time

The W3C ontology for temporal concepts: instants, intervals, before/after relationships, durations.

Used in: Module 2 modeling — when modeling characters whose ninja rank changes across arcs.


Linked Open Data Cloud

A visualization of how public RDF datasets link to each other. Useful for orientation: at a glance, we’ll see the structure of the linked-data ecosystem and which datasets are central (Wikidata, DBpedia, schema.org).

Worth bookmarking. Don’t try to read it linearly; treat it as a map for when one needs to find a dataset in a particular domain.


Domain-specific knowledge graphs

ESCO (European Skills, Competences, Qualifications and Occupations)

A multilingual classification of occupations and skills published as Linked Open Data. Used in my Resume Graph Explorer project.

Used in: Module 1 (Resume Graph RDF slice), Module 3 (skill inference).

FIBO (Financial Industry Business Ontology)

Dean Allemang co-maintains FIBO. It’s an example of a serious, production-grade enterprise ontology — useful to browse for the rigor of its modeling and the discipline of its documentation.

UMLS (Unified Medical Language System)

The medical semantic web’s anchor. Aggregates ~200 biomedical vocabularies. Requires registration to access. Mentioned here as context for the life sciences as the most successful semantic web deployment domain.

Getty Vocabularies

Art and architecture vocabularies as Linked Open Data. The Art and Architecture Thesaurus (AAT), Union List of Artist Names (ULAN), Thesaurus of Geographic Names (TGN).

GeoNames

Geographic database with SPARQL access via various mirrors.


Practice query collections

When I want to learn SPARQL by reading rather than writing, these are the best sources of high-quality example queries.

Hundreds of example queries spanning every domain Wikidata covers. Each is well-commented and explains the SPARQL technique it demonstrates.

Starting points particularly useful for this curriculum:

DuCharme’s Learning SPARQL examples

Every example from the book is available as downloadable Turtle and SPARQL files. Run them locally in Fuseki.

Allemang Working Ontologist datasets

The companion datasets and queries for the textbook. Hosted on data.world.


A note on endpoint reliability

Public SPARQL endpoints come and go. Wikidata is reliable; DBpedia is usually reliable; smaller endpoints sometimes go offline. If I build something that depends on a public endpoint, plan for it to fail occasionally.

The Module 4 federation experiment surfaces this explicitly: federated queries are powerful and slow, and production systems typically materialize external data locally on a refresh schedule rather than relying on live federation.


Syllabus · README · Resources: reading-list.md · tools.md