Friday, March 28, 2014

Ontology Summit 2014 Hackathon, Mar 29 - Ontological Catalog Project

Ontology Summit 2014 Hackathon is tomorrow (Saturday, Mar 29). There are six proposals, all of which look really interesting! You can see a summary of each of them on the Hackathon wiki page.

As for me, I will be participating in the one that is listed last :-), with Amanda Vizedom. But, I have to warn you that the name is a bit overwhelming ... "An ontological catalogue of ontology and metadata vocabulary characteristics relevant to suitability for semantic web and big data applications". Whew!

Really, it is about creating simple sets of concepts (i.e., a "catalog") to characterize vocabularies, models and ontologies. The catalog will be created in GitHub as a publicly available, open-source ontology (collaboratively developed and extensible over time). This Hackathon project is all about the problem of reuse ... You can't reuse something unless you understand not only what its contents are, but the intent behind creating the thing, how it was designed to be used, etc. This is summarized in the following quote from the Summit's Track A synthesis:
Documentation must include the basic details of the semantics, but also the range of conditions, contexts and intended purposes for which the content was developed. It was recommended that standard metadata for reuse be defined and complete exemplars provided.
If you are interested in participating, the kickoff is 10am EDT tomorrow using Google Hangout. Working materials are available at https://github.com/ajvizedom/vocref. If you are interested, have a look at the GitHub site, "watch" the repository, and become a contributor by adding yourself to the team roster on the project's wiki page.

Looking forward to working with you tomorrow!

Andrea

Wednesday, March 12, 2014

Secure Collaboration and Intel's Reliance Point

Intel posted a blog article in mid February, about a research project that they call "Reliance Point". Based on the article, it appears that they are working on ways to selectively share data (addressing privacy and IP rights concerns), and provide integrity and isolation for that data. Intel refers to Reliance Point as a "trustworthy execution environment".

The environment is interesting in that it will bring together data from multiple providers, and allow the providers to perform calculations over the complete set. The providers have to agree on the algorithm that will be used to do the calculations and trust that the infrastructure will protect their data (and not allow other uses or algorithms to be executed, or the data to be revealed).

"Letting Data Breathe" is the name of the blog post. That title seems a bit exaggerated to me. For data to "breathe" (i.e., be integrated from multiple providers), there must be some standard set of semantics and structure that is supported by the providers, or there must a way to map between the syntax and (more importantly) the semantics of the different providers. Otherwise, what do calculations mean when run against data with unknown structure, and/or unknown and disparate semantics?

There is no mention of data integration in the article, just trustworthy data availability and negotiated algorithms. But, it seems to me that the project will not work if the problem of semantics is left to the data providers to solve out-of-band. In particular, how does one provider obtain the semantics of another's available data? How is this revealed while still protecting the IP rights of the provider? If proprietary data is shared, then it is likely proprietary all the way down to the layout and syntax of the data (perhaps defined by SQL). But, I have known companies that are reluctant to share even partial db structures since that information may reveal data or IP details.

To make Reliance Point work, something along the lines of OWL and RDF are needed - a way to specify semantics (OWL + SWRL/RIF, which I will discuss in another post) along with a way to handle multiple schemas (RDF). RDF defines a subject-predicate-object structure for data, which is very flexible. All databases can be translated into it. OWL and SWRL/RIF let you define equivalences, logical statements, disjointness and more, which are necessary to actually (semantically) integrate the data.

In theory, Reliance Point seems good, but Intel is working on the easier part of the problem (the infrastructure) and not the deeper problems that will prevent usage (integrating the data).

Andrea