LinkedDataCampVienna2009/DatasetDynamics

From Linked Data Camp

Jump to: navigation, search

Contents

[edit] Problem

Data within Linked Data sets (LDS) change in the course of time. The frequency of changes depends on the nature of the LDS; publication databases, for instance, are likely to change less frequently than LDS exposing weather information. In any case, clients having some dependency on data in a remote data set need to be notified about changes.

We have identified two major use-cases:

  • UC1: a client mirrors/ replicates (parts of) a LDS set such as DBpedia and get informed about changes on the triple level, i.e., what triples have been added / removed
  • UC2: a client (a LDS publisher) has link dependencies to another LDS. In order to keep links up-to-date, one needs to get informed about changes on a resource level, i.e., changes affecting the resource identification or description
  • UC3: a client (an application) uses one or more LDS and needs to know what has changed (sort of smart cache)

For providing such a notification infrastructure we need the following ingredients:

  1. a vocabulary for representing information about the dynamics of a LDS (Dataset Dynamics Vocabulary)
  2. a vocabulary for expressing WHAT has changed and HOW it has changed, if necessary
  3. a protocol for communicating changes to the client

[edit] Proposed Solutions

[edit] Dataset Dynamics Vocabulary

Dady is a Dataset Dynamics Discovery (D3) vocabulary, which can represent information about the regularity (regular, irregular) and frequency (no, low, mid, high) of updates and provide a link to the update notification source URI. It is designed to be used with voiD. Here is an example:

:ds	rdf:type void:Dataset ;
	foaf:homepage <http://example.org/ds/> ;
	dady:dynamics [
		rdf:type dady:RegularUpdates ;
		rdf:type dady:LowFrequentUpdates ;
		dady:update [
			rdf:type dady:AtomUpdateSource ;
			dady:notification <http://example.org/updates.atom> ;
		] ;
	] .

[edit] Triple-Level change notification protocol

This approach should work for both use cases. The protocol uses Atom to inform the client that data has been added or removed. There are 2 different approaches which could both be offered by one service. Still there are some open issues to be discussed.

Feed paramters: short list or full mode, time period for updates, consolidated or raw changes

[edit] Short list

The short list only offers information about the resources that have changed, but does not offer detailed information about the actual changes.

For example:

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title>LDC09 demo dataset changes</title>
  <link href="http://localhost:8888/dady-demo/publisher/"/>
  <updated>2009-11-30T14:10:00Z</updated>
  <author>
    <name>Michael Hausenblas</name>
  </author>
  <id>12345</id>

  <entry>
    <title>change 123456780</title>
    <link href="http://example.org/resource/A" rel="data-added" />
    <id>123456780</id>
    <updated>2009-11-30T14:10:00Z</updated>
    <summary>added some new resources and removed some old ones</summary>
  </entry>

  <entry>
    <title>change 1234567891</title>
    <link href="http://example.org/resource/B" rel="data-removed" />
    <id>1234567891</id>
    <updated>2009-11-30T17:35:00Z</updated>
    <summary>removed some data</summary>
  </entry>

</feed>


The link element indicates which resources (http://example.org/resource/A, http://example.org/resource/A) are affected. The 'rel' attribute (data-removed / data-added) indicates the nature of an atomic change, i.e., data has been added or removed.

[edit] Full mode

The full mode contains all changes including the changed triples. It might be useful to consolidate changes (e.g. if a triple has changed often in the period of interest only the most recent triple is provided)?

For example:

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title>LDC09 demo dataset changes</title>
  <link href="http://localhost:8888/dady-demo/publisher/"/>
  <updated>2009-11-30T14:10:00Z</updated>
  <author>
    <name>Michael Hausenblas</name>
  </author>
  <id>12345</id>

  <entry>
    <title>change 123456780</title>
    <link href="http://example.org/resource/A" rel="data-added" />
    <id>123456780</id>
    <updated>2009-11-30T14:10:00Z</updated>
    <rdf:Statement>
      <rdf:subject rdf:resource="http://example.com/res#thing"/>
      <rdf:predicate rdf:resource="http://purl.org/dc/elements/1.1/title"/>
      <rdf:object>New Title</rdf:object>
    </rdf:Statement>
    <rdf:Statement>
      <rdf:subject rdf:resource="http://example.com/res2#thing"/>
      <rdf:predicate rdf:resource="http://purl.org/dc/elements/1.1/title"/>
      <rdf:object>New Title for res2</rdf:object>
    </rdf:Statement>
    <summary>added some new resources and removed some old ones</summary>
  </entry>

  <entry>
    <title>change 1234567891</title>
    <link href="http://example.org/resource/B" rel="data-removed" />
    <id>1234567891</id>
    <updated>2009-11-30T17:35:00Z</updated>
    <rdf:Statement>
      <rdf:subject rdf:resource="http://example.com/res#thing"/>
      <rdf:predicate rdf:resource="http://purl.org/dc/elements/1.1/title"/>
      <rdf:object>Old Title</rdf:object>
    </rdf:Statement>
    <summary>removed some data</summary>
  </entry>

</feed>


The link element indicates which resources (http://example.org/resource/A, http://example.org/resource/A) are affected. The 'rel' attribute (data-removed / data-added) indicates the nature of an atomic change, i.e., data has been added or removed.

[edit] Summary / Conclusions Day 1

Common agreement: there is a need to inform clients about resource changes. This problem involves several dimensions: 1.) information on the dynamics of a data source (e.g., how frequently does it change) 2.) how to model dynamics / changes in data sets; which vocabularies to use at the end (there are already some existing ones) 3.) how to transport change information to the client (notifciation/subscription, etc)

ad 1.) makes senses if you do not have a subscription approach, but a notification approach and you need to know how often/frequently you need to check a source for updates. Dady as an extension of Void is a possible solution

ad 2.) the question is if we have change notification on a low (triple) level or higher (event) level. Problem with triple level: how to identify triple externally? If there is no way to address triples you need to transport the complete triple / triple diffs at the end. A higher-level model would provide detailed infos WHAT and HOW a resource has been changed (e.g., removed, moved from A to B, split, merged, etc.)

ad 3.) the idea was to provide both notification (PULL) + subscription (PUSH). Notification via known feed protocols (RSS/Atom) to be provided by the RDF store / data set provider; may be limited to a certain time frame and certain filter criteria (URI filters). For push existing protocols still need to be investigated.


TODOS for Day 2:

- set up some basic real-world use cases (e.g., taken from Wikipedia/DBpedia) - see how existing vocabularies / protocols can fulfill these requirements - define a vocabulary / protocol draft based on existing solutions


[edit] Simple Demo (by LiDRC)

Goal: develop a simple demo covering two aspects:

  • discovery (how does the consumer learn about the change/update mechanism) ... voiD+dady
  • notification (how does the consumer learn about changes) ... extended Atom


[edit] Dady voc

[edit] Documentation

see http://code.google.com/p/dady/wiki/Demos

[edit] Setup

[edit] Publisher

  • Juergen - dataset change manager (DCM)
    • input: dataset URI, poll interval
    • output: Atom feed with changes (link rel='data-added', link rel='data-removed')
  • Michael - voiD+dady

[edit] Consumer

  • Michael - dataset change watch-dog (DCW)
    • input: voiD URI
    • output: live report on changes (HTML/jQuery)


[edit] Discussion Day 1

Starting point: http://esw.w3.org/topic/DatasetDynamics


Involved in the discussion:

  • A. Langegger
  • R. Cyganiak
  • N. Popitsch
  • L. Dodds
  • B. Haslhofer
  • W. Halb
  • C. Ruiz
  • D. Koller (@dakoller)


A major point that was brought up was that we need a model of what can actually happen (events?!) to an identifier. So basically this could be a kind of lifecycle model of URIs (was also discussed to some extent on the mailing list as "URI history"). Some possible events that can happen to URIs:

  • create
  • remove (?)
  • "move" (i.e., "same" data published under a different URI)
  • split/merge

A good starting point could be to further analyze what happens in DBPedia when Wikipedia is updated (analysis already started in the context of DSNotify): pages are renamed, split, disambiguation pages, redirects, etc.

  • How to describe the dynamics of datasets?
    • A first analysis of DBpedia showed that a lot of changes of various kinds take place. Many Wikipedia articles are created/removed/renamed. The update frequencies of articles (representations of DBpedia URIs) also vary a lot: some month only few changes, then suddenly a lot of modifications by many different users.
    • Some dataset providers have these data available, most do not.
  • Do we need new vocabularies to describe datasets in general or should we extend existing ones?
    • We need a list of all vocabs that can be used to describe dataset properties (like Void, Scovo, Talis Changesets, DSNotify eventsets, dady, ...) as started at [1]
  • Is the "event" approach the right one?
    • Events that occur in a dataset are detected by some actor and communicated to subscribers
      • there are multiple well-known models how to do this. Maybe too early to decide which way to go
    • When you know what events happened in your dataset then you can also describe its dynamics
  • What granularity?
    • The simplest useful event could be that a dataset was updated, so the subscribers can react
    • We could define these on a document level (has a particular RDF document changed, was it removed, is it published under a different URI? ...)
    • We could log events based on triple modifications
  • How could one represent such URI lifecycles?
    • Are version control systems applicable to this problem? Could you store your URIs in a SVN server?


  • Should notification be implemented on a triple store level? Can a protocol then send out notifications on a triple level?
    • Problem: triples do not have external identifiers;

[edit] Vocabularies and protocols

Dady,
http://purl.org/NET/dady
Triplify, http://triplify.org/vocabulary/update Talis Changeset,
http://vocab.org/changeset/schema.html,
http://n2.talis.com/wiki/Changesets
Classes

dady:UpdateDynamics
(subclasses)
   dady:UpdateRegularity
   dady:IrregularUpdates
   dady:RegularUpdates

dady:UpdateFrequency
(subclasses)
   dady:NoUpdates
   dady:LowFrequentUpdates
   dady:MidFrequentUpdates
   dady:HighFrequentUpdates

dady:UpdateSource
(subclasses)
   dady:AtomUpdateSource
   dady:ChangeSetUpdateSource
   dady:TriplifyUpdateSource
update:UpdateCollection
update:Update
update:Deletion
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
    xmlns:cs="http://purl.org/vocab/changeset/schema#">
  <cs:ChangeSet rdf:about="http://example.com/changesets#change2">
    <cs:precedingChangeSet rdf:resource="http://example.com/res#change1"/>
    <cs:subjectOfChange rdf:resource="http://example.com/res#thing"/>
    <cs:createdDate>2006-01-01T00:00:00Z</cs:createdDate>
    <cs:creatorName>Anne Onymous</cs:creatorName>
    <cs:changeReason>Change of title</cs:changeReason>
    <cs:removal>
      <rdf:Statement>
        <rdf:subject rdf:resource="http://example.com/res#thing"/>
        <rdf:predicate rdf:resource="http://purl.org/dc/elements/1.1/title"/>
        <rdf:object>Original Title</rdf:object>
      </rdf:Statement>
    </cs:removal>
    <cs:addition>
      <rdf:Statement>
        <rdf:subject rdf:resource="http://example.com/res#thing"/>
        <rdf:predicate rdf:resource="http://purl.org/dc/elements/1.1/title"/>
        <rdf:object>New Title</rdf:object>
      </rdf:Statement>
    </cs:addition>
  </cs:ChangeSet>
</rdf:RDF>
Attributes
dady:dynamics	a rdf:Property ; 
	rdfs:label "has dynamics" ;
	rdfs:domain void:Dataset ; 
	rdfs:range dady:UpdateDynamics .
	
dady:update	a rdf:Property ; 
	rdfs:label "has update source" ;
	rdfs:domain dady:Dynamics ;
	rdfs:range dady:UpdateSource .
update:updatedResource
update:updatedAt
update:updatedBy
Protocol Atom The Talis Platform supports modification of graphs using the Changeset Protocol which uses HTTP to convey Changesets
Personal tools