Submodule 4.2 — Federation and deployment

01 · Federation with SERVICE

One query, two endpoints.

SPARQL 1.1 defines the SERVICE keyword for federated queries: send part of a query to a remote SPARQL endpoint and join the results with local data. The promise is compelling — your local Naruto graph joins seamlessly with Wikidata's 100 million items. The reality involves latency, rate limits, and fragility that production systems mostly avoid.

The basic pattern:

PREFIX wd:  <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX bd:  <http://www.bigdata.com/rdf#>
PREFIX naruto: <https://sensemaking-ai.com/ns/naruto#>

# Local data: characters who appear in the Pain's Assault arc
# Wikidata: the arc's Wikipedia article metadata
SELECT ?characterName ?articleTitle WHERE {
  # Local graph pattern
  ?ninja naruto:canonicalName ?characterName ;
         naruto:appearsInArc   naruto:PainAssaultArc .

  # Federated pattern — sent to Wikidata's endpoint
  SERVICE <https://query.wikidata.org/sparql> {
    wd:Q26530693 wdt:P18 ?image .  # Naruto series Wikidata item
    BIND("Pain's Assault arc" AS ?articleTitle)
    SERVICE wikibase:label {
      bd:serviceParam wikibase:language "en" .
    }
  }
}

The Fuseki endpoint receives the full query, executes the local pattern against local data, sends the SERVICE subquery to Wikidata, and joins the results. Latency is cumulative — local query time plus Wikidata round-trip time, which can be 2–15 seconds on a cold Wikidata query.

The SILENT modifier

If the remote endpoint is unavailable or times out, a plain SERVICE clause fails the entire query. SERVICE SILENT treats failure as returning no rows from the remote endpoint — the rest of the query continues with the local data:

# SILENT: if Wikidata is down, return local results only
SERVICE SILENT <https://query.wikidata.org/sparql> {
  SELECT ?label WHERE {
    wd:Q2009573 rdfs:label ?label .
    FILTER (LANG(?label) = "en")
  }
}

Use SILENT for any remote endpoint you do not control. Wikidata's query service goes down several times per year and rate-limits heavy queries aggressively.

02 · Federation failure modes

Why production systems mostly avoid it.

Latency is cumulative. A local query that runs in 50ms can take 5 seconds with a Wikidata SERVICE clause because the remote endpoint has its own query execution time plus network round-trip. If the federated query runs at user request time, users notice.
Endpoint availability is not guaranteed. Wikidata times out on queries that take more than 60 seconds. It rate-limits clients. It occasionally goes fully offline for maintenance. Any production feature that requires Wikidata to be up will sometimes fail.
Remote schema changes break queries silently. Wikidata can add, rename, or deprecate properties. A query using wdt:P18 (image) that runs today may return nothing next month if the property mapping changes. Federated queries have no schema contract.
The alternative production pattern. Most systems that depend on external data materialize it locally: a scheduled job fetches the needed triples from Wikidata, stores them in the local triplestore, and queries are entirely local. Refresh frequency depends on how stale the data can be. This is faster, more reliable, and easier to debug.

When federation is the right choice

Federation earns its cost when: (a) the data is too large to materialize locally, (b) it must be live/real-time, or (c) you do not have the rights to cache it. ESCO is actually a good candidate — the full dataset is large, it is updated by the EU periodically, and you might prefer to hit the live endpoint rather than maintain a stale copy. For the curriculum, the ESCO stubs in resume-001.ttl are good enough for learning; for a production skill-inference pipeline, federated ESCO makes sense.

03 · Triplestore comparison

Five viable options, each with rough edges.

The Module 4 README names this as a pain point: "no Postgres-equivalent." Choosing a triplestore means accepting specific tradeoffs. Here are the five most relevant for a solo practitioner or small team deploying on EC2:

Triplestore	Best for	Operations burden	Notable limitation
Oxigraph	Simple deployments, solo ops, RDF-star support	Very low — single Rust binary, no JVM	Smaller community; fewer enterprise features; no SPARQL Update federation
Apache Jena Fuseki	Learning, local dev, reference implementation	Medium — requires JVM, more config files	JVM memory footprint; not designed for high-concurrent-write production loads
GraphDB (free tier)	OWL reasoning in production, RDF-star, full-text search	Medium — Docker image, proprietary config	Free tier limits; requires registration; proprietary extensions create lock-in
Stardog	Enterprise OWL reasoning, data virtualization	Medium — cloud or self-hosted; good tooling	Not free for production; pricing can be high at scale
Blazegraph	Wikidata's stack — proven at massive scale	Medium — JVM, complex configuration	No longer actively developed (Wikidata migrated to Virtuoso for new development)

For Exercise 4.1, the Module 4 README recommends Oxigraph for first deployment: single binary, no JVM, simple systemd service file, RDF-star support out of the box. Fuseki is the fallback if you prefer the reference implementation you have been using throughout the curriculum.

04 · Ontology versioning

Semver doesn't quite map.

The Module 4 README calls ontology versioning "unsolved." Standard software version semantics (major.minor.patch) do not map cleanly onto ontology changes:

Adding a new class or property is usually non-breaking — existing data still validates.
Adding a required restriction is potentially breaking — data that was valid before may now fail SHACL validation.
Removing a class or property is breaking — any data using the removed term is now untyped.
Changing a property's domain or range is breaking — existing data may violate the new constraint.

OWL 2 provides three annotation properties for versioning:

@prefix owl: <http://www.w3.org/2002/07/owl#> .

<https://sensemaking-ai.com/ns/naruto>
    owl:versionInfo           "1.0.0" ;
    owl:priorVersion          <https://sensemaking-ai.com/ns/naruto/starter> ;
    owl:backwardCompatibleWith <https://sensemaking-ai.com/ns/naruto/starter> .

# Declare incompatibility explicitly when breaking changes are made
# owl:incompatibleWith <...prior-version>

The practical recommendation: use `owl:versionInfo` for display purposes and commit-level version tracking. Treat the Git history as your actual version control — every commit is a point in time you can diff and restore. OWL versioning annotations are supplementary metadata, not a substitute for version control on the file.

05 · EC2 deployment guide — Exercise 4.1

Add a triplestore to your existing EC2 infrastructure.

These steps assume you already have an EC2 instance with nginx and systemd, following the same pattern as your existing services. Expand each step when ready to execute it.

Oxigraph deployment steps — expand all

1 Install Oxigraph on the EC2 instance expand

SSH into your EC2 instance. Download the latest Oxigraph release binary for Linux x86_64 from the GitHub releases page. As of mid-2026, the binary is named oxigraph.

curl -LO https://github.com/oxigraph/oxigraph/releases/latest/download/oxigraph_linux_x86_64
chmod +x oxigraph_linux_x86_64
sudo mv oxigraph_linux_x86_64 /usr/local/bin/oxigraph
oxigraph --version

Verify the binary runs. Create the data directory:

sudo mkdir -p /var/lib/oxigraph/naruto
sudo chown www-data:www-data /var/lib/oxigraph/naruto

2 Create the systemd service file expand

Model after your existing service files. Create /etc/systemd/system/oxigraph.service:

[Unit]
Description=Oxigraph SPARQL endpoint — Naruto KG
After=network.target

[Service]
Type=simple
User=www-data
ExecStart=/usr/local/bin/oxigraph serve \
  --location /var/lib/oxigraph/naruto \
  --bind 127.0.0.1:7878
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable oxigraph
sudo systemctl start oxigraph
sudo systemctl status oxigraph

Verify Oxigraph is listening: curl http://127.0.0.1:7878/ should return an HTML page.

3 Configure nginx reverse proxy with CORS expand

Create /etc/nginx/sites-available/sparql.barbhs.com. The CORS headers are required for browser-side SPARQL queries from the Naruto KG Explorer:

server {
    server_name sparql.barbhs.com;

    location / {
        proxy_pass http://127.0.0.1:7878;
        proxy_set_header Host $host;

        # CORS for browser SPARQL clients
        add_header 'Access-Control-Allow-Origin' '*' always;
        add_header 'Access-Control-Allow-Methods' 'GET, POST, OPTIONS' always;
        add_header 'Access-Control-Allow-Headers'
            'Accept, Content-Type, Authorization' always;

        if ($request_method = OPTIONS) {
            add_header 'Access-Control-Max-Age' 1728000;
            add_header 'Content-Type' 'text/plain charset=UTF-8';
            add_header 'Content-Length' 0;
            return 204;
        }
    }

    # SSL — certbot will add this block automatically
}

sudo ln -s /etc/nginx/sites-available/sparql.barbhs.com \
           /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx
sudo certbot --nginx -d sparql.barbhs.com

4 Load the Naruto ontology into the deployed store expand

Upload the TTL file to the EC2 instance and load it via the Oxigraph HTTP API or CLI:

scp modules/02-modeling/artifacts/naruto-ontology/naruto-ontology-1.0.0.ttl \
    ec2-user@your-instance:/tmp/

# Load via HTTP
curl -X PUT https://sparql.barbhs.com/store \
  -H "Content-Type: text/turtle" \
  --data-binary @/tmp/naruto-ontology-1.0.0.ttl

# Verify with a test query
curl "https://sparql.barbhs.com/query?query=SELECT+%2A+WHERE+%7B+%3Fs+%3Fp+%3Fo+%7D+LIMIT+10" \
  -H "Accept: application/sparql-results+json"

If the query returns results, the store is loaded and queryable.

5 Add to Uptime Kuma and document the deployment expand

In your Uptime Kuma instance, add a new HTTP monitor for https://sparql.barbhs.com/query?query=ASK%20%7B%20%3Fs%20%3Fp%20%3Fo%20%7D. The ASK query returns a valid SPARQL response (200 OK with true) whenever the endpoint is up and data is loaded. This is a better health check than just pinging the root path.

Create modules/04-shipping/docs/triplestore-deployment.md documenting: Oxigraph version installed, systemd config decisions, nginx config, how to reload data, how to back up the data directory, and any EC2-specific surprises. This becomes the infrastructure case study artifact.

06 · Federation query lab

Three queries — Exercise 4.2 patterns.

Run these from your local Fuseki against the public Wikidata endpoint. They demonstrate the federation patterns for Exercise 4.2 and directly expose the failure modes described in Section 2.

f01

Join local Naruto arc data with Wikidata's article for the Naruto series.

Pattern: SERVICE federation · local + remote join · latency measurement

PREFIX wd:      <http://www.wikidata.org/entity/>
PREFIX wdt:     <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX bd:      <http://www.bigdata.com/rdf#>
PREFIX schema:  <https://schema.org/>
PREFIX naruto:  <https://sensemaking-ai.com/ns/naruto#>

# Local: which arcs are in the Naruto ontology?
# Remote: look up the Naruto series item on Wikidata for metadata
SELECT ?arcName ?wikidataLabel WHERE {
  ?arc a naruto:Arc ;
       schema:name ?arcName .
  FILTER (LANG(?arcName) = "en")

  SERVICE SILENT <https://query.wikidata.org/sparql> {
    SELECT ?wikidataLabel WHERE {
      wd:Q2009573 rdfs:label ?wikidataLabel .
      FILTER (LANG(?wikidataLabel) = "en")
    }
  }
}
ORDER BY ?arcName

What to observe

Note the query execution time (visible in Fuseki's bottom bar). The first run is typically 2–8 seconds; subsequent runs may be faster if Wikidata caches the response. This is the federation latency cost — a pure local query over the same arcs would run in under 50ms.

If Wikidata is unavailable, SERVICE SILENT returns the local data with ?wikidataLabel unbound rather than failing. Without SILENT, the whole query would error. Note the production implication: SERVICE without SILENT makes your endpoint reliability dependent on Wikidata's.

f02

Retrieve your father's published patents from Wikidata and join with local portfolio data.

Pattern: real-world federation · Wikidata as external data source · the Exercise 4.2 example

PREFIX wd:      <http://www.wikidata.org/entity/>
PREFIX wdt:     <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX bd:      <http://www.bigdata.com/rdf#>

# Exercise 4.2 specifies using your father's published patents
# as an example of real federated data retrieval.
# Replace Q_ID with the actual Wikidata ID for the person.
# Search at wikidata.org to find the Q-number first.

SELECT ?patent ?patentLabel ?date WHERE {
  SERVICE <https://query.wikidata.org/sparql> {
    ?patent wdt:P31  wd:Q253623 ;  # instance of: patent
            wdt:P178 wd:REPLACE_WITH_INVENTOR_Q_ID .
    OPTIONAL { ?patent wdt:P577 ?date . }
    SERVICE wikibase:label {
      bd:serviceParam wikibase:language "en" .
    }
  }
}
ORDER BY DESC(?date)

How to adapt this query

1. Search Wikidata for the inventor's name to find their Q-number. Replace REPLACE_WITH_INVENTOR_Q_ID with the actual Q-number (e.g., wd:Q12345678).

2. Run the query in the Wikidata Query Service directly first (query.wikidata.org) to verify it returns results. Then run it through local Fuseki with the SERVICE wrapper to observe the latency comparison.

3. Note the Exercise 4.2 requirement: document the latency, failure modes (try with SILENT vs without), and what happens if Wikidata returns an unexpected schema change.

f03

Simulate what a materialization refresh job would retrieve — all Naruto character Wikidata items.

Pattern: data materialization planning · what to prefetch to avoid runtime federation

PREFIX wd:      <http://www.wikidata.org/entity/>
PREFIX wdt:     <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX bd:      <http://www.bigdata.com/rdf#>
PREFIX schema:  <https://schema.org/>

# Query Wikidata for items tagged as Naruto characters.
# This is what a nightly materialization job would run to
# prefetch metadata rather than federating at query time.
SELECT ?character ?characterLabel ?gender ?voiceActor WHERE {
  SERVICE <https://query.wikidata.org/sparql> {
    ?character wdt:P31    wd:Q15711870 .  # instance of: fictional character
    ?character wdt:P1080  wd:Q2009573 .   # from fictional universe: Naruto
    OPTIONAL { ?character wdt:P21 ?gender . }
    OPTIONAL { ?character wdt:P725 ?voiceActor . }
    SERVICE wikibase:label {
      bd:serviceParam wikibase:language "en" .
    }
  }
}
ORDER BY ?characterLabel
LIMIT 20

What this demonstrates

This query is slow (Wikidata must scan its full fictional characters dataset). Adding LIMIT 20 makes it feasible — without LIMIT, it would likely time out. This is exactly the failure mode the Module 4 README describes: "most production systems materialize external data locally and refresh on a schedule."

The practical takeaway: run this query once, INSERT the results into your deployed Oxigraph store with a timestamp, and serve it locally from then on. Refresh weekly. This eliminates the runtime federation dependency while keeping the data reasonably current.

07 · Resources

Reading, tools, and next steps.

Primary reading

DuCharme — Ch 5–6

Chapter 5 covers federation in depth including the failure modes. Chapter 6 covers real-world applications — read both before starting Exercise 4.2.

Triplestore

Oxigraph GitHub

The recommended deployment triplestore. Check the releases page for the latest Linux binary. The README covers the HTTP API — this is how you load data, run queries, and manage the store from the CLI.

Federation spec

W3C SPARQL 1.1 Federated Query

The full SERVICE keyword specification. Section 4 covers SERVICE SILENT semantics. Read before Exercise 4.2 to understand the behavior you are observing.

Prior submodule

Submodule 4.1 — SPARQL UPDATE

The UPDATE operations needed to manage data lifecycle in your deployed triplestore. The INSERT DATA and COPY operations from 4.1 are what you will use to load and refresh data after deployment.

Next submodule

Submodule 4.3 — LLM + KG integration

The capstone: TwinKit Semantic v2.0, hybrid retrieval, the Naruto KG Explorer, and the honest evaluation framework. Requires the deployment from Exercise 4.1 to be complete first.

Deployment docs

triplestore-deployment.md (you write this — Exercise 4.1 deliverable)

The artifact for Exercise 4.1: document every decision made during the EC2 deployment. This becomes the infrastructure case study and is part of the Module 4 README checklist.