Monday, January 25, 2016

Ontologies for Reuse Versus Integration

There is ongoing email discussion in preparation for this year's Ontology Summit on "semantic integration". I thought that I would share one of my recent posts to that discussion here, on my blog. The issue is reuse versus integration ...

For me, designing for general reuse is a valid goal and valuable (if you have the time, which is not always true). (Also it was the subject of the Summit 2 yrs ago and many of my posts from that time - March-May 2014!) But reusing an ontology or design pattern in multiple places is not semantic integration. Reuse and integration are different beasts, although they are complimentary.

I have designed ontologies for both uses (reuse and integration), but my approach to the two is different. Designing for reuse is usually focused on a small domain that is well understood. There are general problem areas (such as creating ontologies/design patterns for events, or to support Allen's time interval algebra) that are generally applicable. In these areas, general design and reuse makes sense.

Over the years, however, I have been much more focused on designing for integration (especially in the commercial space). In my experience, companies are always trying to combine different systems together - whether these systems are legacy vs new, systems that come into the mix due to acquisition, internal (company-centric) vs external (customer-driven), dictated by the problem space (combining systems from different vendors or different parts of an organization to solve a business problem), ...

It is ok to try to be forward-thinking in designing these integration ontologies ... anticipating areas of integration. But, I have been wrong in my guesses (of what was needed in the "future" ontology) probably more than I have been right - unless it was indeed in general problem domains.

So, my integration "rules of thumb" are:
  • Get the SMEs in a particular domain to define the problem space and their solution (don't ever ask the SMEs about integrating their domains)
  • Don't ever give favor to one domain over another in influencing the ontology (you are sure to not be future-proof)
  • Focus on the biggest problem areas first, and find the commonalities/general concepts (superclasses)
  • Place the domain details "under" these superclasses
  • Never try to change the vocabulary of a domain, just map to/from the domains to the "integration" ontology
  • Never map everything in a domain, just what needs to be integrated
  • Look for smaller areas of "general patterns" that can be broadly reused
  • Have new work start from the integrating ontology instead of creating a totally new model
  • Update the integrating ontology based on mapping problems and new work (never claim that the ontology is immutable)
  • Utilize OWL's equivalentClass/disjointFrom/intersectionOf/unionOf/... (for classes), sameAs/differentFrom (for individuals) and class and property restrictions to tie concepts together in the "mapped" space
  • Be focused on concept diagrams and descriptions, documenting mapping details, ... and not that you are using an ontology
  • Clearly document ontology/concept, relationship, ... evolution
Let me know if this resonates with you or if you have different "rules of thumb".

Andrea

Sunday, January 3, 2016

2016 and continuing posts on ontologies

Well, 2015 seems to have gotten away from me. Over the last year, I have been working to design and implement several ontologies for policy-based management. The work is based on Complexible's Stardog graph database, with services accessed through a RESTful API, and with a front-end, single-page web application created with Bootstrap and Backbone. It has been a blast working with and learning all these technologies, and my new year's resolution is to get back into writing my blog and share some of my learnings.

Another thing that I am doing is related to the International Association for Ontology and Its Applications (IAOA). More specifically, I am a part of the Semantic Web Applied Ontologies Special Interest Group (SWAO SIG). The SIG is continuing the work of the 2014 Ontology Summit and facilitating discussions of ontologies and their development and application. I will be contributing to those discussions in 2016, and started with a short post on the various definitions of the term, ontology. Check out the SWAO link above for the discussion!

That's it for now. Happy 2016!

Andrea

Friday, November 28, 2014

More links to document-related ontologies

Over the course of the last few weeks, a few people have emailed me additional references for document ontologies. These are both valuable links. I want to use them (in my solution) and so, need to expand the references from my previous post to add these:
  • SALT (Semantically Annotated LaTeX for Scientific Publications) Document Ontology
    • SALT is described in this paper, but unfortunately none of the associated/referenced ontologies are still available
  • FRBRoo
    • A harmonization of the FRBR model (an entity-relationship model from the International Federation of Library Associations and Institutions, published in 1998) and the CIDOC CRM4 model (ISO Standard 21127.5 developed by ICOM-CIDOC, the International Council for Museums – International Committee on Documentation)
    • Per the documentation for Version 2 of the model, FRBRoo is "a formal ontology that captures and represents the underlying semantics of bibliographic information and therefore facilitates the integration, mediation, and interchange of bibliographic and museum information"
    • Also, there is an "owlified" version available at https://github.com/erlangen-crm
Given these new insights, I have a bit more work to do on my solution.

Andrea

Monday, November 3, 2014

Document-related ontologies

In my previous post, I defined the competency questions and initial focus for an ontology development effort. The goal of that effort is to answer the question, "What access and handling policies are in effect for a document?"

A relatively (and judging by the length of this post, I do mean "relatively"!) easy place to start is by creating the document-related ontology(ies). (Remember that I am explicitly walking through all the steps in my development process and not just throwing out an answer. At this time, I don't know what the complete answer is!)

Unless your background is content management, or you are a metadata expert, the first step is to learn the basic principles and concepts of the domain being modeled. This helps to define the ontologies and also establishes a base set of knowledge so that you can talk with your customer and their subject-matter experts (SMEs). Never assume that you are the first person to model a domain, or that you inherently know the concepts because they are "obvious". (Unless you invented the domain, you are not the first person to model it. Unless you work in the domain, you don't really know it!)

Beyond just learning about the domain, there are additional advantages for understanding the basic principles and concepts, and looking at previous work ... First, you don't want to waste the experts' time. That is valuable and often limited. The more that you waste an expert's time, the less that they want to talk to you. Second, you need to understand the basics since these are sometimes so obvious to the experts that they consider it "implicit". I.E., they don't say anything about the basics and their assumptions, and eventually you get confused or lost (because you don't have the necessary background or make your own assumptions in real-time, in the conversations). Third, it is valuable to know where mistakes might have been made or where models were created that seem "wrong" to you. Also, it is valuable to know where there are differences of opinion in a domain - and know where your experts land, on which side of a debate. Understanding boundary cases, and maybe accounting for multiple solutions, may make the difference between your ontology succeeding or failing.

Background knowledge can come from many places. But, I usually start with Google, Bing, Yahoo, etc. (given your personal preference). I type in various phrases and then follow the links. Here are some of the phrases that I started with, for the "documents" space:
  • Dublin Core (since that was specifically mentioned in the competency questions)
  • Document metadata
  • Document management system
  • Document ontology (since there may be a complete ontology ready to adapt or directly reuse)
Clearly this is just a starting list, since each link leads to others. It is valuable to review any Wikipedia links that come up (as they usually provide a level-set). Especially, pay attention to standards. Then dig a bit deeper, looking at academic and business articles, papers and whitepapers. You can do this with a search engine and by checking your company's, or an organization's (such as IEEE and ACM), digital library.

Here is where my initial investigations took me: You can also take a look at the metadata-ontology that I developed from Dublin Core and SKOS, and discussed in earlier posts.

As for the RDF and ontologies, I don't want to take them "as-is" and just put them together. I first want to quickly review them, as well as the ideas from relevant other references (such as OASIS's ODF). Then, we can begin to define a base ontology. It is important to always keep our immediate goals in focus (which are mostly related to document metadata), but also have an idea of probable (or possible) extensions to the ontologies.

When creating my ontologies, I usually (always?) end up taking piece-parts and reusing concepts from multiple sources. The parts can be imported and rationalized via an integrating ontology, or are cut and pasted from the different sources into a new ontology. There are advantages and disadvantages to each approach.

When importing the original ontologies and integrating them (especially when using a tool like Protege), you end up with a large number of classes and properties, with (hopefully) many duplicates or (worst case) many subtle differences in semantics. This can be difficult to manage and sort through, and it takes time to get a good understanding of the individual model/ontology semantics. Another problem with this approach is that the ontologies sometimes evolve. If this happens, URLs may change and your imports could break. Or, you may end up referencing a concept that was renamed or no longer exists. Ideally, when an ontology is published, a link is maintained to the individual versions, but this does not always happen. I usually take the latest version of an ontology or model, and save it to a local directory, maintaining the link to the source and also noting the version (for provenance).

Cutting and pasting the various piece parts of different ontologies makes it easier to initially create and control your ontology. The downside is that you sometimes lose the origins and provenance of the piece parts, and/or lose the ability to expand into new areas of the original ontologies. The latter may happen because those ontologies are not "in front" of you ("out of sight, out of mind") or because you have deviated too far from the original semantics and structure.

In my next posts, I will continue to discuss a design for the document-related ontologies (focusing on the immediate needs to reflect the Dublin Core metadata and the existence/location of the documents). In the meantime, let me know if I missed any valuable references, or if you have other ideas for the ontologies.

Andrea

Sunday, October 19, 2014

Breaking Down the "Documents and Policies" Project - Competency Questions

Our previous post defined a project for which a set of ontologies is needed ... "What access and handling policies are in effect for a document?" So, let's just jump into it!

The first step is always to understand the full scope of work and yet to be able to focus your development activities. Define what is needed both initially (to establish your work and ontologies) and ultimately (at the end of the project). Determine how to develop the ontologies, in increments, to reach the "ultimate" solution. Each increment should improve or expand your design, taking care to never go too far in one step (one development cycle). This is really an agile approach and translates to developing, testing, iterating until things are correct, and then expanding. Assume that your initial solutions will need to be improved and reworked as your development activities progress. Don't be afraid to find and correct design errors. But ... Your development should always be driven by detailed use cases and (corresponding) competency questions.

Competency questions were discussed in an earlier post, "General, Reusable Metadata Ontology - V0.2". (They are the questions that your ontology should be able to answer.) Let's assume that you and your customer define the following top-level questions:
  • What documents are in my repositories?
  • What documents are protected or affected by policies?
  • What documents are not protected or affected by policies? (I.E., what are the holes?)
  • What policies are defined?
  • What are the types of those policies (e.g., access or handling/digital rights)?
  • What the details of a specific policy?
  • Who was the author of a specific policy?
  • List all documents that are protected by multiple access control policies. And, list the policies by document.
  • List all documents that are affected by multiple handling/digital rights policies. And, list the policies by document.
These questions should lead you to ask other questions, trying to determine the boundaries of the complete problem. Remember that it is unlikely that the customers' needs will be addressed in a single set of development activities. (And, work will hopefully expand with your successes!) Often, a customer has deeper (or maybe) different questions that they have not yet begun to define. Asking questions and working with your customer can begin to tease this apart. Even if the customer does not want to go further at this time, it is valuable to understand where and how the ontologies may need to be expanded. Always take care to leave room to expand your ontologies to address new use cases and semantics.

This brings us back to "General Systems Thinking". It is important to understand a system, its parts and its boundaries.

Here are some follow-on questions (and their answers) that the competency questions could generate:
  • Q: Given that you have document repositories, how are the documents identified and tagged?
    • A: A subset of the Dublin Core information is collected for each document: Author, modified-by, title, creation date, date last modified, keywords, proprietary/non-proprietary flag, and description.
  • Q: How are the documents related to policies?
    • A: Policies apply to documents based on a combination of their metadata.
  • Q: Will we ever care about parts of documents, or do we only care about the documents as a whole?
    • A: We may ultimately want to apply policies to parts of documents, or subset a document based on its contents and provide access to its parts. But, this is a future enhancement.
  • Q: Do policies change over time (for example, becoming obsolete)?
    • A: Yes, we will have to worry about policy evolution and track that.
  • Q: What policy repositories do you have?
    • A: Policies are defined in code and in some specific content management systems. The goal is to collect the details related to all the documents and all the policies in order to guarantee consistency and remove/reduce conflicts.
  • Q: Given the last 2 competency questions, and your goal of removing/reducing conflicts, would you ultimately like the system to find inconsistencies and conflicts? How about making recommendations to correct these?
    • A: Yes! (We will need to dig into this further at a later time in order to define conflicts and remediation schemes.)
Well, we now know more about the ontologies that we will be creating. Initially, we are concerned with document identification/location/metadata and related access and digital rights policies. We can then move onto the provenance and evolution of documents and policies, and understanding conflicts and their remediation.

So, the next step is to flesh out the details for documents and policies. We will begin to do that in the next post.

Andrea

Monday, October 13, 2014

Understanding semantics and Pinker's "Curse of Knowledge"

I recently read an interesting editorial in the Wall Street Journal from Steven Pinker. It was titled, "The Source of Bad Writing", and discussed something that Pinker called the "Curse of Knowledge".
Curse of Knowledge: a difficulty in imagining what it is like for someone else not to know something that you know
After reading that article, looking at the various posts asking where to find good online courses on semantic technologies and linked data, discussing problems related to finding qualified job candidates, and listening to people (like my husband) who say that I make their heads explode, I decided to talk about semantics differently. Instead of explaining specific aspects of ontologies or semantics, or writing about disconnected aspects of the technologies, I want to go back to basics and explore how and what I do in creating ontologies, what to worry about, how to create, evolve and use an ontology and triple store, ...

Then, I need some feedback from my readers. As Steven Pinker says,
A ... way to exorcise the curse of knowledge is to close the loop, ... and get a feedback signal from the world of readers—that is, show a draft to some people who are similar to your intended audience and find out whether they can follow it. ... The other way to escape the curse of knowledge is to show a draft to yourself, ideally after enough time has passed that the text is no longer familiar. If you are like me you will find yourself thinking, "What did I mean by that?" or "How does this follow?" or, all too often, "Who wrote this crap?"
There are many good papers, books and blog posts on the languages, technologies and standards behind the Semantic Web. (Hopefully, some of my work is there.) I don't want to create yet another tutorial on these, but I do want to talk about creating and using ontologies. So, for the next 6 months or so, my goal is to design and create a set of ontologies through these blog posts - delving into existing ontologies, and semantic languages/standards and tools. In addition, as the ontologies are created, I will discuss using them - which moves us into triple stores and queries.

As we go along, I will reference specs from the W3C, other blog posts and information and tools on the web. My goal is that you can get all of the related specs, tools and details for free. I hope that you will be interested enough to scan or download them (or you might know and use them already), and ask more questions. What is important is to understand the basics, and then we can build from there.

The first question is "What is the subject of the ontology that we will be building and using?" Since I am interested in policy-based management, I would like to develop an ontology and infrastructure to answer the question: "What access and handling policies are in effect for a document?"

At first blush, you might think that the process is relatively easy. Find the document, get its details, find what policies apply, and then follow those policies. But, the policies that apply are possibly dictated by the subject or author of the document, or when it was written (since regulations and company policies change over time). Worse, the access policies are likely defined (and stored) separately from the handling/digital rights policies, but need to be considered together. Lastly, how do we even begin to understand what the policies are saying?

I hope that you see that I did not choose an easy subject at all, but one that will take some time to think through and develop. I am looking forward to doing this and would like your feedback, questions, comments and advice, along the way.

Andrea

Saturday, July 26, 2014

Another OWL diagramming transform and some more thoughts on writing

With summer in full-tilt and lots going on, I seem to have lost track of time and been delinquent in publishing new posts. I want to get back into writing slowly ... with a small post that builds on two of my previous ones.

First, I wrote a new XSL transform that outputs all NamedIndividuals specified in an ontology file. The purpose was to help with diagramming enumerations. (I made a simplifying assumption that you added individuals into a .owl file in order to create enumerated or exemplary individuals.) The location of the transform is GitHub (check out http://purl.org/NinePts/graphing). And, details on how to use the transform (for example, with the graphical editor, yED) is described in my post, Diagramming an RDF/XML ontology.

If you don't want some individuals included, feel free to refine the transform, or just delete individuals after an initial layout with yEd.

Second, here are some more writing tips, that build on the post, Words and writing .... Most of these I learned in high school (a very long time ago), as editor of the school paper. (And, yes, I still use them today.)
  • My teacher taught us to vary the first letter of each paragraph, and start the paragraphs with interesting words (e.g., not "the", "this", "a", ...). Her point was that people got an impression of the article from glancing at the page, and the first words of the paragraphs made the most impression. If the words were boring, then the article was boring. I don't know if this is true, but it seems like a reasonable thing.
  • Another good practice is to make sure your paragraphs are relatively short, so as not to seem overwhelming. (I try to keep my paragraphs under 5-6 sentences.) Also, each paragraph should have a clear focus and stick to it. It is difficult to read when the main subject of a paragraph wanders.
  • Lastly, use a good opening sentence for each paragraph. It should establish the contents of the paragraph - setting it up for more details to come in the following sentences.
You can check out more writing tips at "Hot 100 News Writing Tips".

Andrea