Monday, November 3, 2014

Document-related ontologies

In my previous post, I defined the competency questions and initial focus for an ontology development effort. The goal of that effort is to answer the question, "What access and handling policies are in effect for a document?"

A relatively (and judging by the length of this post, I do mean "relatively"!) easy place to start is by creating the document-related ontology(ies). (Remember that I am explicitly walking through all the steps in my development process and not just throwing out an answer. At this time, I don't know what the complete answer is!)

Unless your background is content management, or you are a metadata expert, the first step is to learn the basic principles and concepts of the domain being modeled. This helps to define the ontologies and also establishes a base set of knowledge so that you can talk with your customer and their subject-matter experts (SMEs). Never assume that you are the first person to model a domain, or that you inherently know the concepts because they are "obvious". (Unless you invented the domain, you are not the first person to model it. Unless you work in the domain, you don't really know it!)

Beyond just learning about the domain, there are additional advantages for understanding the basic principles and concepts, and looking at previous work ... First, you don't want to waste the experts' time. That is valuable and often limited. The more that you waste an expert's time, the less that they want to talk to you. Second, you need to understand the basics since these are sometimes so obvious to the experts that they consider it "implicit". I.E., they don't say anything about the basics and their assumptions, and eventually you get confused or lost (because you don't have the necessary background or make your own assumptions in real-time, in the conversations). Third, it is valuable to know where mistakes might have been made or where models were created that seem "wrong" to you. Also, it is valuable to know where there are differences of opinion in a domain - and know where your experts land, on which side of a debate. Understanding boundary cases, and maybe accounting for multiple solutions, may make the difference between your ontology succeeding or failing.

Background knowledge can come from many places. But, I usually start with Google, Bing, Yahoo, etc. (given your personal preference). I type in various phrases and then follow the links. Here are some of the phrases that I started with, for the "documents" space:
  • Dublin Core (since that was specifically mentioned in the competency questions)
  • Document metadata
  • Document management system
  • Document ontology (since there may be a complete ontology ready to adapt or directly reuse)
Clearly this is just a starting list, since each link leads to others. It is valuable to review any Wikipedia links that come up (as they usually provide a level-set). Especially, pay attention to standards. Then dig a bit deeper, looking at academic and business articles, papers and whitepapers. You can do this with a search engine and by checking your company's, or an organization's (such as IEEE and ACM), digital library.

Here is where my initial investigations took me: You can also take a look at the metadata-ontology that I developed from Dublin Core and SKOS, and discussed in earlier posts.

As for the RDF and ontologies, I don't want to take them "as-is" and just put them together. I first want to quickly review them, as well as the ideas from relevant other references (such as OASIS's ODF). Then, we can begin to define a base ontology. It is important to always keep our immediate goals in focus (which are mostly related to document metadata), but also have an idea of probable (or possible) extensions to the ontologies.

When creating my ontologies, I usually (always?) end up taking piece-parts and reusing concepts from multiple sources. The parts can be imported and rationalized via an integrating ontology, or are cut and pasted from the different sources into a new ontology. There are advantages and disadvantages to each approach.

When importing the original ontologies and integrating them (especially when using a tool like Protege), you end up with a large number of classes and properties, with (hopefully) many duplicates or (worst case) many subtle differences in semantics. This can be difficult to manage and sort through, and it takes time to get a good understanding of the individual model/ontology semantics. Another problem with this approach is that the ontologies sometimes evolve. If this happens, URLs may change and your imports could break. Or, you may end up referencing a concept that was renamed or no longer exists. Ideally, when an ontology is published, a link is maintained to the individual versions, but this does not always happen. I usually take the latest version of an ontology or model, and save it to a local directory, maintaining the link to the source and also noting the version (for provenance).

Cutting and pasting the various piece parts of different ontologies makes it easier to initially create and control your ontology. The downside is that you sometimes lose the origins and provenance of the piece parts, and/or lose the ability to expand into new areas of the original ontologies. The latter may happen because those ontologies are not "in front" of you ("out of sight, out of mind") or because you have deviated too far from the original semantics and structure.

In my next posts, I will continue to discuss a design for the document-related ontologies (focusing on the immediate needs to reflect the Dublin Core metadata and the existence/location of the documents). In the meantime, let me know if I missed any valuable references, or if you have other ideas for the ontologies.

Andrea

No comments:

Post a Comment