LinkedDataCampVienna2009/DatasetDynamics
From Linked Data Camp
Contents |
[edit] Problem
Data within Linked Data sets (LDS) change in the course of time. The frequency of changes depends on the nature of the LDS; publication databases, for instance, are likely to change less frequently than LDS exposing weather information. In any case, clients having some dependency on data in a remote data set need to be notified about changes.
We have identified two major use-cases:
- UC1: a client mirrors/ replicates (parts of) a LDS set such as DBpedia and get informed about changes on the triple level, i.e., what triples have been added / removed
- UC2: a client (a LDS publisher) has link dependencies to another LDS. In order to keep links up-to-date, one needs to get informed about changes on a resource level, i.e., changes affecting the resource identification or description
- UC3: a client (an application) uses one or more LDS and needs to know what has changed (sort of smart cache)
For providing such a notification infrastructure we need the following ingredients:
- a vocabulary for representing information about the dynamics of a LDS (Dataset Dynamics Vocabulary)
- a vocabulary for expressing WHAT has changed and HOW it has changed, if necessary
- a protocol for communicating changes to the client
[edit] Proposed Solutions
[edit] Dataset Dynamics Vocabulary
Dady is a Dataset Dynamics Discovery (D3) vocabulary, which can represent information about the regularity (regular, irregular) and frequency (no, low, mid, high) of updates and provide a link to the update notification source URI. It is designed to be used with voiD. Here is an example:
:ds rdf:type void:Dataset ; foaf:homepage <http://example.org/ds/> ; dady:dynamics [ rdf:type dady:RegularUpdates ; rdf:type dady:LowFrequentUpdates ; dady:update [ rdf:type dady:AtomUpdateSource ; dady:notification <http://example.org/updates.atom> ; ] ; ] .
[edit] Triple-Level change notification protocol
This approach should work for both use cases. The protocol uses Atom to inform the client that data has been added or removed. There are 2 different approaches which could both be offered by one service. Still there are some open issues to be discussed.
Feed paramters: short list or full mode, time period for updates, consolidated or raw changes
[edit] Short list
The short list only offers information about the resources that have changed, but does not offer detailed information about the actual changes.
For example:
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>LDC09 demo dataset changes</title>
<link href="http://localhost:8888/dady-demo/publisher/"/>
<updated>2009-11-30T14:10:00Z</updated>
<author>
<name>Michael Hausenblas</name>
</author>
<id>12345</id>
<entry>
<title>change 123456780</title>
<link href="http://example.org/resource/A" rel="data-added" />
<id>123456780</id>
<updated>2009-11-30T14:10:00Z</updated>
<summary>added some new resources and removed some old ones</summary>
</entry>
<entry>
<title>change 1234567891</title>
<link href="http://example.org/resource/B" rel="data-removed" />
<id>1234567891</id>
<updated>2009-11-30T17:35:00Z</updated>
<summary>removed some data</summary>
</entry>
</feed>
The link element indicates which resources (http://example.org/resource/A, http://example.org/resource/A) are affected. The 'rel' attribute (data-removed / data-added) indicates the nature of an atomic change, i.e., data has been added or removed.
[edit] Full mode
The full mode contains all changes including the changed triples. It might be useful to consolidate changes (e.g. if a triple has changed often in the period of interest only the most recent triple is provided)?
For example:
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>LDC09 demo dataset changes</title>
<link href="http://localhost:8888/dady-demo/publisher/"/>
<updated>2009-11-30T14:10:00Z</updated>
<author>
<name>Michael Hausenblas</name>
</author>
<id>12345</id>
<entry>
<title>change 123456780</title>
<link href="http://example.org/resource/A" rel="data-added" />
<id>123456780</id>
<updated>2009-11-30T14:10:00Z</updated>
<rdf:Statement>
<rdf:subject rdf:resource="http://example.com/res#thing"/>
<rdf:predicate rdf:resource="http://purl.org/dc/elements/1.1/title"/>
<rdf:object>New Title</rdf:object>
</rdf:Statement>
<rdf:Statement>
<rdf:subject rdf:resource="http://example.com/res2#thing"/>
<rdf:predicate rdf:resource="http://purl.org/dc/elements/1.1/title"/>
<rdf:object>New Title for res2</rdf:object>
</rdf:Statement>
<summary>added some new resources and removed some old ones</summary>
</entry>
<entry>
<title>change 1234567891</title>
<link href="http://example.org/resource/B" rel="data-removed" />
<id>1234567891</id>
<updated>2009-11-30T17:35:00Z</updated>
<rdf:Statement>
<rdf:subject rdf:resource="http://example.com/res#thing"/>
<rdf:predicate rdf:resource="http://purl.org/dc/elements/1.1/title"/>
<rdf:object>Old Title</rdf:object>
</rdf:Statement>
<summary>removed some data</summary>
</entry>
</feed>
The link element indicates which resources (http://example.org/resource/A, http://example.org/resource/A) are affected. The 'rel' attribute (data-removed / data-added) indicates the nature of an atomic change, i.e., data has been added or removed.
[edit] Summary / Conclusions Day 1
Common agreement: there is a need to inform clients about resource changes. This problem involves several dimensions: 1.) information on the dynamics of a data source (e.g., how frequently does it change) 2.) how to model dynamics / changes in data sets; which vocabularies to use at the end (there are already some existing ones) 3.) how to transport change information to the client (notifciation/subscription, etc)
ad 1.) makes senses if you do not have a subscription approach, but a notification approach and you need to know how often/frequently you need to check a source for updates. Dady as an extension of Void is a possible solution
ad 2.) the question is if we have change notification on a low (triple) level or higher (event) level. Problem with triple level: how to identify triple externally? If there is no way to address triples you need to transport the complete triple / triple diffs at the end. A higher-level model would provide detailed infos WHAT and HOW a resource has been changed (e.g., removed, moved from A to B, split, merged, etc.)
ad 3.) the idea was to provide both notification (PULL) + subscription (PUSH). Notification via known feed protocols (RSS/Atom) to be provided by the RDF store / data set provider; may be limited to a certain time frame and certain filter criteria (URI filters). For push existing protocols still need to be investigated.
TODOS for Day 2:
- set up some basic real-world use cases (e.g., taken from Wikipedia/DBpedia) - see how existing vocabularies / protocols can fulfill these requirements - define a vocabulary / protocol draft based on existing solutions
[edit] Simple Demo (by LiDRC)
Goal: develop a simple demo covering two aspects:
- discovery (how does the consumer learn about the change/update mechanism) ... voiD+dady
- notification (how does the consumer learn about changes) ... extended Atom
[edit] Dady voc
- dady voc
- dady voc visualisation 220k PNG
[edit] Documentation
see http://code.google.com/p/dady/wiki/Demos
[edit] Setup
[edit] Publisher
- Juergen - dataset change manager (DCM)
- input: dataset URI, poll interval
- output: Atom feed with changes (
link rel='data-added',link rel='data-removed')
- Michael - voiD+dady
[edit] Consumer
- Michael - dataset change watch-dog (DCW)
- input: voiD URI
- output: live report on changes (HTML/jQuery)
[edit] Discussion Day 1
Starting point: http://esw.w3.org/topic/DatasetDynamics
Involved in the discussion:
- A. Langegger
- R. Cyganiak
- N. Popitsch
- L. Dodds
- B. Haslhofer
- W. Halb
- C. Ruiz
- D. Koller (@dakoller)
A major point that was brought up was that we need a model of what can actually happen (events?!) to an identifier. So
basically this could be a kind of lifecycle model of URIs (was also discussed to some extent on the mailing list as
"URI history").
Some possible events that can happen to URIs:
- create
- remove (?)
- "move" (i.e., "same" data published under a different URI)
- split/merge
A good starting point could be to further analyze what happens in DBPedia when Wikipedia is updated (analysis already started in the context of DSNotify): pages are renamed, split, disambiguation pages, redirects, etc.
- How to describe the dynamics of datasets?
- A first analysis of DBpedia showed that a lot of changes of various kinds take place. Many Wikipedia articles are created/removed/renamed. The update frequencies of articles (representations of DBpedia URIs) also vary a lot: some month only few changes, then suddenly a lot of modifications by many different users.
- Some dataset providers have these data available, most do not.
- Do we need new vocabularies to describe datasets in general or should we extend existing ones?
- We need a list of all vocabs that can be used to describe dataset properties (like Void, Scovo, Talis Changesets, DSNotify eventsets, dady, ...) as started at [1]
- Is the "event" approach the right one?
- Events that occur in a dataset are detected by some actor and communicated to subscribers
- there are multiple well-known models how to do this. Maybe too early to decide which way to go
- When you know what events happened in your dataset then you can also describe its dynamics
- Events that occur in a dataset are detected by some actor and communicated to subscribers
- What granularity?
- The simplest useful event could be that a dataset was updated, so the subscribers can react
- We could define these on a document level (has a particular RDF document changed, was it removed, is it published under a different URI? ...)
- We could log events based on triple modifications
- How could one represent such URI lifecycles?
- Are version control systems applicable to this problem? Could you store your URIs in a SVN server?
- Should notification be implemented on a triple store level? Can a protocol then send out notifications on a triple level?
- Problem: triples do not have external identifiers;
[edit] Vocabularies and protocols
| Dady, http://purl.org/NET/dady | Triplify, http://triplify.org/vocabulary/update | Talis Changeset, http://vocab.org/changeset/schema.html, http://n2.talis.com/wiki/Changesets | |
|---|---|---|---|
| Classes | dady:UpdateDynamics (subclasses) dady:UpdateRegularity dady:IrregularUpdates dady:RegularUpdates dady:UpdateFrequency (subclasses) dady:NoUpdates dady:LowFrequentUpdates dady:MidFrequentUpdates dady:HighFrequentUpdates dady:UpdateSource (subclasses) dady:AtomUpdateSource dady:ChangeSetUpdateSource dady:TriplifyUpdateSource | update:UpdateCollection update:Update update:Deletion |
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:cs="http://purl.org/vocab/changeset/schema#">
<cs:ChangeSet rdf:about="http://example.com/changesets#change2">
<cs:precedingChangeSet rdf:resource="http://example.com/res#change1"/>
<cs:subjectOfChange rdf:resource="http://example.com/res#thing"/>
<cs:createdDate>2006-01-01T00:00:00Z</cs:createdDate>
<cs:creatorName>Anne Onymous</cs:creatorName>
<cs:changeReason>Change of title</cs:changeReason>
<cs:removal>
<rdf:Statement>
<rdf:subject rdf:resource="http://example.com/res#thing"/>
<rdf:predicate rdf:resource="http://purl.org/dc/elements/1.1/title"/>
<rdf:object>Original Title</rdf:object>
</rdf:Statement>
</cs:removal>
<cs:addition>
<rdf:Statement>
<rdf:subject rdf:resource="http://example.com/res#thing"/>
<rdf:predicate rdf:resource="http://purl.org/dc/elements/1.1/title"/>
<rdf:object>New Title</rdf:object>
</rdf:Statement>
</cs:addition>
</cs:ChangeSet>
</rdf:RDF>
|
| Attributes |
dady:dynamics a rdf:Property ; rdfs:label "has dynamics" ; rdfs:domain void:Dataset ; rdfs:range dady:UpdateDynamics . dady:update a rdf:Property ; rdfs:label "has update source" ; rdfs:domain dady:Dynamics ; rdfs:range dady:UpdateSource . |
update:updatedResource update:updatedAt update:updatedBy | |
| Protocol | Atom | The Talis Platform supports modification of graphs using the Changeset Protocol which uses HTTP to convey Changesets |

