LinkedDataCampVienna2009/DatasetDynamics

From Linked Data Camp

Jump to: navigation, search

Contents

Problem

Data within Linked Data sets (LDS) change in the course of time. The frequency of changes depends on the nature of the LDS; publication databases, for instance, are likely to change less frequently than LDS exposing weather information. In any case, clients having some dependency on data in a remote data set need to be notified about changes.

We have identified two major use-cases:

  • UC1: a client mirrors/ replicates (parts of) a LDS set such as DBpedia and get informed about changes on the triple level, i.e., what triples have been added / removed
  • UC2: a client (a LDS publisher) has link dependencies to another LDS. In order to keep links up-to-date, one needs to get informed about changes on a resource level, i.e., changes affecting the resource identification or description
  • UC3: a client (an application) uses one or more LDS and needs to know what has changed (sort of smart cache)

For providing such a notification infrastructure we need the following ingredients:

  1. a vocabulary for representing information about the dynamics of a LDS (Dataset Dynamics Vocabulary)
  2. a vocabulary for expressing WHAT has changed and HOW it has changed, if necessary
  3. a protocol for communicating changes to the client

Proposed Solutions

Dataset Dynamics Vocabulary

Dady is a Dataset Dynamics Discovery (D3) vocabulary, which can represent information about the regularity (regular, irregular) and frequency (no, low, mid, high) of updates and provide a link to the update notification source URI. It is designed to be used with voiD. Here is an example:

:ds	rdf:type void:Dataset ;
	foaf:homepage <http://example.org/ds/> ;
	dady:dynamics [
		rdf:type dady:RegularUpdates ;
		rdf:type dady:LowFrequentUpdates ;
		dady:update [
			rdf:type dady:AtomUpdateSource ;
			dady:notification <http://example.org/updates.atom> ;
		] ;
	] .

Triple-Level change notification protocol

This approach should work for both use cases. The protocol uses Atom to inform the client that data has been added or removed. There are 2 different approaches which could both be offered by one service. Still there are some open issues to be discussed.

Feed paramters: short list or full mode, time period for updates, consolidated or raw changes

Short list

The short list only offers information about the resources that have changed, but does not offer detailed information about the actual changes.

For example:

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title>LDC09 demo dataset changes</title>
  <link href="http://localhost:8888/dady-demo/publisher/"/>
  <updated>2009-11-30T14:10:00Z</updated>
  <author>
    <name>Michael Hausenblas</name>
  </author>
  <id>12345</id>

  <entry>
    <title>change 123456780</title>
    <link href="http://example.org/resource/A" rel="data-added" />
    <id>123456780</id>
    <updated>2009-11-30T14:10:00Z</updated>
    <summary>added some new resources and removed some old ones</summary>
  </entry>

  <entry>
    <title>change 1234567891</title>
    <link href="http://example.org/resource/B" rel="data-removed" />
    <id>1234567891</id>
    <updated>2009-11-30T17:35:00Z</updated>
    <summary>removed some data</summary>
  </entry>

</feed>


The link element indicates which resources (http://example.org/resource/A, http://example.org/resource/A) are affected. The 'rel' attribute (data-removed / data-added) indicates the nature of an atomic change, i.e., data has been added or removed.

Full mode

The full mode contains all changes including the changed triples. It might be useful to consolidate changes (e.g. if a triple has changed often in the period of interest only the most recent triple is provided)?

For example:

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title>LDC09 demo dataset changes</title>
  <link href="http://localhost:8888/dady-demo/publisher/"/>
  <updated>2009-11-30T14:10:00Z</updated>
  <author>
    <name>Michael Hausenblas</name>
  </author>
  <id>12345</id>

  <entry>
    <title>change 123456780</title>
    <link href="http://example.org/resource/A" rel="data-added" />
    <id>123456780</id>
    <updated>2009-11-30T14:10:00Z</updated>
    <rdf:Statement>
      <rdf:subject rdf:resource="http://example.com/res#thing"/>
      <rdf:predicate rdf:resource="http://purl.org/dc/elements/1.1/title"/>
      <rdf:object>New Title</rdf:object>
    </rdf:Statement>
    <rdf:Statement>
      <rdf:subject rdf:resource="http://example.com/res2#thing"/>
      <rdf:predicate rdf:resource="http://purl.org/dc/elements/1.1/title"/>
      <rdf:object>New Title for res2</rdf:object>
    </rdf:Statement>
    <summary>added some new resources and removed some old ones</summary>
  </entry>

  <entry>
    <title>change 1234567891</title>
    <link href="http://example.org/resource/B" rel="data-removed" />
    <id>1234567891</id>
    <updated>2009-11-30T17:35:00Z</updated>
    <rdf:Statement>
      <rdf:subject rdf:resource="http://example.com/res#thing"/>
      <rdf:predicate rdf:resource="http://purl.org/dc/elements/1.1/title"/>
      <rdf:object>Old Title</rdf:object>
    </rdf:Statement>
    <summary>removed some data</summary>
  </entry>

</feed>


The link element indicates which resources (http://example.org/resource/A, http://example.org/resource/A) are affected. The 'rel' attribute (data-removed / data-added) indicates the nature of an atomic change, i.e., data has been added or removed.

Summary / Conclusions Day 1

Common agreement: there is a need to inform clients about resource changes. This problem involves several dimensions: 1.) information on the dynamics of a data source (e.g., how frequently does it change) 2.) how to model dynamics / changes in data sets; which vocabularies to use at the end (there are already some existing ones) 3.) how to transport change information to the client (notifciation/subscription, etc)

ad 1.) makes senses if you do not have a subscription approach, but a notification approach and you need to know how often/frequently you need to check a source for updates. Dady as an extension of Void is a possible solution

ad 2.) the question is if we have change notification on a low (triple) level or higher (event) level. Problem with triple level: how to identify triple externally? If there is no way to address triples you need to transport the complete triple / triple diffs at the end. A higher-level model would provide detailed infos WHAT and HOW a resource has been changed (e.g., removed, moved from A to B, split, merged, etc.)

ad 3.) the idea was to provide both notification (PULL) + subscription (PUSH). Notification via known feed protocols (RSS/Atom) to be provided by the RDF store / data set provider; may be limited to a certain time frame and certain filter criteria (URI filters). For push existing protocols still need to be investigated.


TODOS for Day 2:

- set up some basic real-world use cases (e.g., taken from Wikipedia/DBpedia) - see how existing vocabularies / protocols can fulfill these requirements - define a vocabulary / protocol draft based on existing solutions


Simple Demo (by LiDRC)

Goal: develop a simple demo covering two aspects:

  • discovery (how does the consumer learn about the change/update mechanism) ... voiD+dady
  • notification (how does the consumer learn about changes) ... extended Atom


Dady voc

Documentation

see http://code.google.com/p/dady/wiki/Demos

Setup

Publisher

  • Juergen - dataset change manager (DCM)
    • input: dataset URI, poll interval
    • output: Atom feed with changes (link rel='data-added', link rel='data-removed')
  • Michael - voiD+dady

Consumer

  • Michael - dataset change watch-dog (DCW)
    • input: voiD URI
    • output: live report on changes (HTML/jQuery)


Discussion Day 1

Starting point: http://esw.w3.org/topic/DatasetDynamics


Involved in the discussion:

  • A. Langegger
  • R. Cyganiak
  • N. Popitsch
  • L. Dodds
  • B. Haslhofer
  • W. Halb
  • C. Ruiz
  • D. Koller (@dakoller)


A major point that was brought up was that we need a model of what can actually happen (events?!) to an identifier. So basically this could be a kind of lifecycle model of URIs (was also discussed to some extent on the mailing list as "URI history"). Some possible events that can happen to URIs:

  • create
  • remove (?)
  • "move" (i.e., "same" data published under a different URI)
  • split/merge

A good starting point could be to further analyze what happens in DBPedia when Wikipedia is updated (analysis already started in the context of DSNotify): pages are renamed, split, disambiguation pages, redirects, etc.

  • How to describe the dynamics of datasets?
    • A first analysis of DBpedia showed that a lot of changes of various kinds take place. Many Wikipedia articles are created/removed/renamed. The update frequencies of articles (representations of DBpedia URIs) also vary a lot: some month only few changes, then suddenly a lot of modifications by many different users.
    • Some dataset providers have these data available, most do not.
  • Do we need new vocabularies to describe datasets in general or should we extend existing ones?
    • We need a list of all vocabs that can be used to describe dataset properties (like Void, Scovo, Talis Changesets, DSNotify eventsets, dady, ...) as started at [1]
  • Is the "event" approach the right one?
    • Events that occur in a dataset are detected by some actor and communicated to subscribers
      • there are multiple well-known models how to do this. Maybe too early to decide which way to go
    • When you know what events happened in your dataset then you can also describe its dynamics
  • What granularity?
    • The simplest useful event could be that a dataset was updated, so the subscribers can react
    • We could define these on a document level (has a particular RDF document changed, was it removed, is it published under a different URI? ...)
    • We could log events based on triple modifications
  • How could one represent such URI lifecycles?
    • Are version control systems applicable to this problem? Could you store your URIs in a SVN server?


  • Should notification be implemented on a triple store level? Can a protocol then send out notifications on a triple level?
    • Problem: triples do not have external identifiers;

Vocabularies and protocols

Dady,
http://purl.org/NET/dady
Triplify, http://triplify.org/vocabulary/update Talis Changeset,
http://vocab.org/changeset/schema.html,
http://n2.talis.com/wiki/Changesets
Classes

dady:UpdateDynamics
(subclasses)
   dady:UpdateRegularity
   dady:IrregularUpdates
   dady:RegularUpdates

dady:UpdateFrequency
(subclasses)
   dady:NoUpdates
   dady:LowFrequentUpdates
   dady:MidFrequentUpdates
   dady:HighFrequentUpdates

dady:UpdateSource
(subclasses)
   dady:AtomUpdateSource
   dady:ChangeSetUpdateSource
   dady:TriplifyUpdateSource
update:UpdateCollection
update:Update
update:Deletion
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
    xmlns:cs="http://purl.org/vocab/changeset/schema#">
  <cs:ChangeSet rdf:about="http://example.com/changesets#change2">
    <cs:precedingChangeSet rdf:resource="http://example.com/res#change1"/>
    <cs:subjectOfChange rdf:resource="http://example.com/res#thing"/>
    <cs:createdDate>2006-01-01T00:00:00Z</cs:createdDate>
    <cs:creatorName>Anne Onymous</cs:creatorName>
    <cs:changeReason>Change of title</cs:changeReason>
    <cs:removal>
      <rdf:Statement>
        <rdf:subject rdf:resource="http://example.com/res#thing"/>
        <rdf:predicate rdf:resource="http://purl.org/dc/elements/1.1/title"/>
        <rdf:object>Original Title</rdf:object>
      </rdf:Statement>
    </cs:removal>
    <cs:addition>
      <rdf:Statement>
        <rdf:subject rdf:resource="http://example.com/res#thing"/>
        <rdf:predicate rdf:resource="http://purl.org/dc/elements/1.1/title"/>
        <rdf:object>New Title</rdf:object>
      </rdf:Statement>
    </cs:addition>
  </cs:ChangeSet>
</rdf:RDF>
Attributes
dady:dynamics	a rdf:Property ; 
	rdfs:label "has dynamics" ;
	rdfs:domain void:Dataset ; 
	rdfs:range dady:UpdateDynamics .
	
dady:update	a rdf:Property ; 
	rdfs:label "has update source" ;
	rdfs:domain dady:Dynamics ;
	rdfs:range dady:UpdateSource .
update:updatedResource
update:updatedAt
update:updatedBy
Protocol Atom The Talis Platform supports modification of graphs using the Changeset Protocol which uses HTTP to convey Changesets
Personal tools