Monday, January 25, 2016

Ontologies for Reuse Versus Integration

There is ongoing email discussion in preparation for this year's Ontology Summit on "semantic integration". I thought that I would share one of my recent posts to that discussion here, on my blog. The issue is reuse versus integration ...

For me, designing for general reuse is a valid goal and valuable (if you have the time, which is not always true). (Also it was the subject of the Summit 2 yrs ago and many of my posts from that time - March-May 2014!) But reusing an ontology or design pattern in multiple places is not semantic integration. Reuse and integration are different beasts, although they are complimentary.

I have designed ontologies for both uses (reuse and integration), but my approach to the two is different. Designing for reuse is usually focused on a small domain that is well understood. There are general problem areas (such as creating ontologies/design patterns for events, or to support Allen's time interval algebra) that are generally applicable. In these areas, general design and reuse makes sense.

Over the years, however, I have been much more focused on designing for integration (especially in the commercial space). In my experience, companies are always trying to combine different systems together - whether these systems are legacy vs new, systems that come into the mix due to acquisition, internal (company-centric) vs external (customer-driven), dictated by the problem space (combining systems from different vendors or different parts of an organization to solve a business problem), ...

It is ok to try to be forward-thinking in designing these integration ontologies ... anticipating areas of integration. But, I have been wrong in my guesses (of what was needed in the "future" ontology) probably more than I have been right - unless it was indeed in general problem domains.

So, my integration "rules of thumb" are:
  • Get the SMEs in a particular domain to define the problem space and their solution (don't ever ask the SMEs about integrating their domains)
  • Don't ever give favor to one domain over another in influencing the ontology (you are sure to not be future-proof)
  • Focus on the biggest problem areas first, and find the commonalities/general concepts (superclasses)
  • Place the domain details "under" these superclasses
  • Never try to change the vocabulary of a domain, just map to/from the domains to the "integration" ontology
  • Never map everything in a domain, just what needs to be integrated
  • Look for smaller areas of "general patterns" that can be broadly reused
  • Have new work start from the integrating ontology instead of creating a totally new model
  • Update the integrating ontology based on mapping problems and new work (never claim that the ontology is immutable)
  • Utilize OWL's equivalentClass/disjointFrom/intersectionOf/unionOf/... (for classes), sameAs/differentFrom (for individuals) and class and property restrictions to tie concepts together in the "mapped" space
  • Be focused on concept diagrams and descriptions, documenting mapping details, ... and not that you are using an ontology
  • Clearly document ontology/concept, relationship, ... evolution
Let me know if this resonates with you or if you have different "rules of thumb".

Andrea

4 comments:

  1. Efficiency / OWL Profile note;
    Using the union class constructors can make reasoning over OWL ontologies quite a bit slower. It may also prevent the use of some reasoners, or disable important optimizations.

    For many purposes you can replace the union with a new class that is a super class of the various choices. Reasoning results will not be complete, but may include the desired inferences.

    If we define A = (B or C) , and we know that x is an A, and x is not a B, we can infer x is a C.
    if we only have B subclassOf A , and C subClass of A, we cannot make this inference.

    ReplyDelete
    Replies
    1. Simon, Yes, that is a good point. I would add that "your mileage may vary" and whether you make this trade-off is an implementation detail - versus a definitional detail.

      In my case, I use the Stardog triple store (with an embedded Pellet reasoner) and have not had performance issues with reasoning over these kinds of queries, but I do other optimizations - such as limiting the data over which reasoning is performed by using named graphs.

      Delete
    2. Stardog Pellet may be able to avoid performance problems, but the wrong constructs can prevent it from exhibiting performance anti-problems (see http://docs.stardog.com/#_performance hints ).

      If you are working with very large t-boxes (such as most of the main bio-sciences), keeping within EL is a good thing.

      It is still worthwhile using unionOf constructors if they are more precise models of the application domain; they can be trivially transformed in to one way inferences using the owlapi. The same thing applies to using full DL,or DL + rules, though the approximations become rougher.

      Conversely, it is useful to make performance-at-scale part of your nightly test suite (AJV's non-owl example of "Sorry to keep you waiting- the thing that killed you yesterday was an air assault. Is this correct? Hello?") This is the sort of integration test for integrating ontologies that can catch problems before it's too late to be selective.

      The difference between profiles can be most significant for the top performing reasoners (which aren't as tied into database integration.

      There doesn't seem to be a tool like pellet lint for new pellet

      Delete
    3. I agree with your insights regarding actual performance and the need to test performance in your nightly builds! Very valid! I should probably do some blogging on that topic since reality interferes a lot with theory. :-)

      Another more mundane version of the "implementation" problem is the complexity/power of the OWL Time ontology vs storing an xsd:dateTimestamp (or begin/end timestamps) in a triple store. There are huge differences between the two approaches in expressivity and performance.

      But, in my post above, I was trying to convey semantics/model definitions as opposed to performance tuning. (Kind of like the differences between conceptual and logical models and physical models for relational dbs.)

      Also, I have found that the EL restrictions are sometimes too restrictive for my needs - and then I am forced to live with less-than optimum performance.

      All the world is a trade-off...

      Delete