Tuesday, September 19, 2017

What can be learned from the OntoGraph project?

OntoGraph was introduced in my last post, OWL Ontology Graphing Program Available as Open Source. And there are a lot of interesting things in the code! Over the next weeks, I want to take time to relay my learnings, as well as to provide insights into OWL ontologies, SPARQL queries, Bootstrap, Backbone and RESTful interfaces, the Model-View-Controller and other patterns, Spring Boot, Lombok, programming Stardog, testing, Gradle builds, and much more. Some of this will be basic stuff (but hopefully useful to some of my readers) and some will be more advanced. Feel free to pick and choose, or let me know what you want to hear about!

But, first, I want to talk about our development environment ...

The precursor to OntoGraph was originally created in about 2 days to provide some basic diagrams of a customer's ontology. Hand-drawing all the classes, properties, axioms, etc. of the ontologies was too painful and error-prone. Using a tool like OntoViz with Protege was just not flexible enough, and the images were not what the customer wanted to see. The ProtegeVOWL plug-in was also not sufficient since VOWL does not diagram all the necessary constructs (I will talk more about this in a future post). In addition, the customer did not want to be tied to using Protege since they weren't ontologists. They just wanted a diagram and to be able to play around with the layout.

Well, the 2 day "quick and dirty" version worked and the customer had their diagrams. That could have been the end of the story. But, we hired an intern who needed to learn about ontologies, the Stardog triple store, SPARQL queries and lots of other things. So, we decided to use the graphing program as a learning experience. We took the initial work and decided first to just address some bugs. Then, we decided to add the ability to customize the output, which required a front-end. Then, we added support for different kinds of visualization (Graffoo, VOWL, UML). And, the program grew. We changed directions, rewrote whole sections of the program, updated our approach to the front-end at least three times, updated our approach to testing at least twice, and upgraded our infrastructure at least twice (updating the Gradle, Stardog, Javascript libraries, etc.). We put months of work into the program, definitely taking an agile approach and learning to "fail fast".

There are lessons here ... Good software takes time. There is always more to learn. Don't be afraid to take what you learn and rewrite what is problematic (as long as you have time and there are no other programming fires burning). There is always something that you can do better. And, always remember that Stack Overflow is your friend!

Well, ok then ... back to agile. For our agile environment, we used Atlassian's products - JIRA for issue tracking and managing our process (Kanban actually), integrated with a Bitbucket Git repository for version control, and Bamboo as our continuous integration environment. Since we are a small company, this was an easy and cheap solution ($10 for each product). In addition, when we decided to get serious about releasing the code as open-source, we also decided to incoporate SonarQube into our continuous integration environment.

As someone who either spent too much or too little time on code reviews, SonarQube was great! Per Wikipedia, it provides "continuous inspection of code quality to perform automatic reviews with static analysis of code to detect bugs, code smells and security vulnerabilities" (http://en.wikipedia.org/wiki/SonarQube). And, it does this for 20+ programming languages (but we only needed Java, JavaScript and CSS). This took a lot of the pain out of code reviews. I focused on whether the method and property names were understandable, if the code seemed reasonable and was somewhat efficient, and things like that. SonarQube took care of finding problems related to bad practice, lack of efficiency, and errors (such as not initializing a variable). In addition, SonarQube would complain if you nested if/while/for/switch/try statements too deeply, or implemented methods with too many parameters or that were too complex. In reality, SonarQube was tougher on my code than any team review that I had experienced in the past.

Now, you can make things easier on yourself and change the defaults in the SonarQube rules. For example, you can allow a complexity of 30 instead of 15, or allow nesting of if/while/... past 3 levels. But, we didn't do that for OntoGraph. We figured that we would keep the defaults and fix most of the problems (or, we would eventually fix them). There are some "issues" that are just false positives, and others that we have not yet addressed. If you want to find them in the OntoGraph code, just look for "//NOSONAR" and then the explanation that follows. The "//NOSONAR" comment tells SonarQube to ignore the issue for now - either it is a false positive or we acknowledge that there is a problem and are willing to accept the issue for now. I think that this is a valuable approach. Most of the existing issues in OntoGraph are complexity, and we will fix those!

Another important aspect is test coverage. When we decided to release OntoGraph as open source, we set a testing threshold of at least 80% on the back-end processing classes (so this would be GraphController.java, GraphDAO.java and all the classes in the graphmloutputs folder). All of these classes have coverage between 93.2% and 98.3%, except one. TitleAndPrefixCreation.java has a test coverage of 77.8%, with 2 (yes, 2) uncovered lines. Those lines throw an IllegalAccessError exception if something tries to instantiate the class (which should not be done since the class contains only 1 static method). Oh well, we decided that this was definitely good enough!

You can see SonarQube in action by downloading OntoGraph and following SonarQube's instructions for Getting Started in Two Minutes. After starting and logging into SonarQube according to the instructions, go to where you downloaded OntoGraph. Type "./gradlew sonar" or "gradlew.bat sonar" for Windows (making sure that you have installed Gradle :-). After that completes, you can see all the rules/issues, statistics and more.

P.S. Sorry for the riff on SonarQube, but I wanted to hit on some cool details. And, I will talk about how Gradle supports SonarQube in a future post. This post just got way too long!


Wednesday, September 13, 2017

OWL Ontology Graphing Program Available as Open Source

It has been forever since I last blogged on this site (more than a year, for which I feel terrible). I have been wrapped up in work for a customer whose details are proprietary, and I was also slowly working to create (what I hope will be valuable) ontology graphing software. I wished that the work on the graphing software would have been available sooner, but better late than never ... The graphing software is called OntoGraph, is finally at a point where it is acceptable to publish, and I can freely discuss it on the blog! So, here we go ...

You can check out the work at Nine Points Solutions' GitHub repository.

OntoGraph is a Spring Boot application for graphing OWL ontologies (yes, the title says this). It lets you go from XML/RDF, Turtle and several other OWL syntaxes to a custom, Graffoo, VOWL or UML-like diagram. For example, you can go from something like this (this excerpt comes from the Friend of a Friend, FOAF.rdf ontology - you can see the complete FOAF ontology at http://xmlns.com/foaf/spec/index.rdf) ...

To ...

The above image is a VOWL rendering of FOAF.

OntoGraph is designed with a Bootstrap- and Backbone-based GUI (written in Javascript), interfacing with a RESTful API. The main program is written in Java. It operates by creating various GraphML outputs of a user-provided OWL ontology file. (Or, it also accepts a zip file of a set of ontology files). The program stores the ontologies in the Stardog triple store, then runs a series of queries to return the necessary information on the classes, properties, individuals... to be diagrammed. Layout of the resulting GraphML is handled by another program. (We recommend yEd.)

Four visualizations of ontology data can be generated:
  • Custom format (defined to fit existing business or personal preferences)
  • Graffoo
  • UML-like
  • VOWL
And, information can be segmented to display:
  • Class-related information (subclassing, equivalent and disjoint classes, class restrictions, ...)
  • Individual instances, their types, and their datatype and object property information
  • Property information (datatype and object properties, functional/symmetric/... properties, domain and range definitions, ...)
  • Both class and property information
Complete information about OntoGraph, how to run it, and issues and upcoming features are available at the GitHub repository. Also, there is a pre-publication version of a paper there, there explains OntoGraph and why it was created. (The paper will be available in the next issue of the Journal of Applied Ontology, from IOS Press.)

So, now that OntoGraph is finally published, I can start to blog about its components, design and design decisions, testing, and lots of other details. I just needed something concrete!

I hope that you find the program useful!


Monday, January 25, 2016

Ontologies for Reuse Versus Integration

There is ongoing email discussion in preparation for this year's Ontology Summit on "semantic integration". I thought that I would share one of my recent posts to that discussion here, on my blog. The issue is reuse versus integration ...

For me, designing for general reuse is a valid goal and valuable (if you have the time, which is not always true). (Also it was the subject of the Summit 2 yrs ago and many of my posts from that time - March-May 2014!) But reusing an ontology or design pattern in multiple places is not semantic integration. Reuse and integration are different beasts, although they are complimentary.

I have designed ontologies for both uses (reuse and integration), but my approach to the two is different. Designing for reuse is usually focused on a small domain that is well understood. There are general problem areas (such as creating ontologies/design patterns for events, or to support Allen's time interval algebra) that are generally applicable. In these areas, general design and reuse makes sense.

Over the years, however, I have been much more focused on designing for integration (especially in the commercial space). In my experience, companies are always trying to combine different systems together - whether these systems are legacy vs new, systems that come into the mix due to acquisition, internal (company-centric) vs external (customer-driven), dictated by the problem space (combining systems from different vendors or different parts of an organization to solve a business problem), ...

It is ok to try to be forward-thinking in designing these integration ontologies ... anticipating areas of integration. But, I have been wrong in my guesses (of what was needed in the "future" ontology) probably more than I have been right - unless it was indeed in general problem domains.

So, my integration "rules of thumb" are:
  • Get the SMEs in a particular domain to define the problem space and their solution (don't ever ask the SMEs about integrating their domains)
  • Don't ever give favor to one domain over another in influencing the ontology (you are sure to not be future-proof)
  • Focus on the biggest problem areas first, and find the commonalities/general concepts (superclasses)
  • Place the domain details "under" these superclasses
  • Never try to change the vocabulary of a domain, just map to/from the domains to the "integration" ontology
  • Never map everything in a domain, just what needs to be integrated
  • Look for smaller areas of "general patterns" that can be broadly reused
  • Have new work start from the integrating ontology instead of creating a totally new model
  • Update the integrating ontology based on mapping problems and new work (never claim that the ontology is immutable)
  • Utilize OWL's equivalentClass/disjointFrom/intersectionOf/unionOf/... (for classes), sameAs/differentFrom (for individuals) and class and property restrictions to tie concepts together in the "mapped" space
  • Be focused on concept diagrams and descriptions, documenting mapping details, ... and not that you are using an ontology
  • Clearly document ontology/concept, relationship, ... evolution
Let me know if this resonates with you or if you have different "rules of thumb".


Sunday, January 3, 2016

2016 and continuing posts on ontologies

Well, 2015 seems to have gotten away from me. Over the last year, I have been working to design and implement several ontologies for policy-based management. The work is based on Complexible's Stardog graph database, with services accessed through a RESTful API, and with a front-end, single-page web application created with Bootstrap and Backbone. It has been a blast working with and learning all these technologies, and my new year's resolution is to get back into writing my blog and share some of my learnings.

Another thing that I am doing is related to the International Association for Ontology and Its Applications (IAOA). More specifically, I am a part of the Semantic Web Applied Ontologies Special Interest Group (SWAO SIG). The SIG is continuing the work of the 2014 Ontology Summit and facilitating discussions of ontologies and their development and application. I will be contributing to those discussions in 2016, and started with a short post on the various definitions of the term, ontology. Check out the SWAO link above for the discussion!

That's it for now. Happy 2016!


Friday, November 28, 2014

More links to document-related ontologies

Over the course of the last few weeks, a few people have emailed me additional references for document ontologies. These are both valuable links. I want to use them (in my solution) and so, need to expand the references from my previous post to add these:
  • SALT (Semantically Annotated LaTeX for Scientific Publications) Document Ontology
    • SALT is described in this paper, but unfortunately none of the associated/referenced ontologies are still available
  • FRBRoo
    • A harmonization of the FRBR model (an entity-relationship model from the International Federation of Library Associations and Institutions, published in 1998) and the CIDOC CRM4 model (ISO Standard 21127.5 developed by ICOM-CIDOC, the International Council for Museums – International Committee on Documentation)
    • Per the documentation for Version 2 of the model, FRBRoo is "a formal ontology that captures and represents the underlying semantics of bibliographic information and therefore facilitates the integration, mediation, and interchange of bibliographic and museum information"
    • Also, there is an "owlified" version available at https://github.com/erlangen-crm
Given these new insights, I have a bit more work to do on my solution.


Monday, November 3, 2014

Document-related ontologies

In my previous post, I defined the competency questions and initial focus for an ontology development effort. The goal of that effort is to answer the question, "What access and handling policies are in effect for a document?"

A relatively (and judging by the length of this post, I do mean "relatively"!) easy place to start is by creating the document-related ontology(ies). (Remember that I am explicitly walking through all the steps in my development process and not just throwing out an answer. At this time, I don't know what the complete answer is!)

Unless your background is content management, or you are a metadata expert, the first step is to learn the basic principles and concepts of the domain being modeled. This helps to define the ontologies and also establishes a base set of knowledge so that you can talk with your customer and their subject-matter experts (SMEs). Never assume that you are the first person to model a domain, or that you inherently know the concepts because they are "obvious". (Unless you invented the domain, you are not the first person to model it. Unless you work in the domain, you don't really know it!)

Beyond just learning about the domain, there are additional advantages for understanding the basic principles and concepts, and looking at previous work ... First, you don't want to waste the experts' time. That is valuable and often limited. The more that you waste an expert's time, the less that they want to talk to you. Second, you need to understand the basics since these are sometimes so obvious to the experts that they consider it "implicit". I.E., they don't say anything about the basics and their assumptions, and eventually you get confused or lost (because you don't have the necessary background or make your own assumptions in real-time, in the conversations). Third, it is valuable to know where mistakes might have been made or where models were created that seem "wrong" to you. Also, it is valuable to know where there are differences of opinion in a domain - and know where your experts land, on which side of a debate. Understanding boundary cases, and maybe accounting for multiple solutions, may make the difference between your ontology succeeding or failing.

Background knowledge can come from many places. But, I usually start with Google, Bing, Yahoo, etc. (given your personal preference). I type in various phrases and then follow the links. Here are some of the phrases that I started with, for the "documents" space:
  • Dublin Core (since that was specifically mentioned in the competency questions)
  • Document metadata
  • Document management system
  • Document ontology (since there may be a complete ontology ready to adapt or directly reuse)
Clearly this is just a starting list, since each link leads to others. It is valuable to review any Wikipedia links that come up (as they usually provide a level-set). Especially, pay attention to standards. Then dig a bit deeper, looking at academic and business articles, papers and whitepapers. You can do this with a search engine and by checking your company's, or an organization's (such as IEEE and ACM), digital library.

Here is where my initial investigations took me: You can also take a look at the metadata-ontology that I developed from Dublin Core and SKOS, and discussed in earlier posts.

As for the RDF and ontologies, I don't want to take them "as-is" and just put them together. I first want to quickly review them, as well as the ideas from relevant other references (such as OASIS's ODF). Then, we can begin to define a base ontology. It is important to always keep our immediate goals in focus (which are mostly related to document metadata), but also have an idea of probable (or possible) extensions to the ontologies.

When creating my ontologies, I usually (always?) end up taking piece-parts and reusing concepts from multiple sources. The parts can be imported and rationalized via an integrating ontology, or are cut and pasted from the different sources into a new ontology. There are advantages and disadvantages to each approach.

When importing the original ontologies and integrating them (especially when using a tool like Protege), you end up with a large number of classes and properties, with (hopefully) many duplicates or (worst case) many subtle differences in semantics. This can be difficult to manage and sort through, and it takes time to get a good understanding of the individual model/ontology semantics. Another problem with this approach is that the ontologies sometimes evolve. If this happens, URLs may change and your imports could break. Or, you may end up referencing a concept that was renamed or no longer exists. Ideally, when an ontology is published, a link is maintained to the individual versions, but this does not always happen. I usually take the latest version of an ontology or model, and save it to a local directory, maintaining the link to the source and also noting the version (for provenance).

Cutting and pasting the various piece parts of different ontologies makes it easier to initially create and control your ontology. The downside is that you sometimes lose the origins and provenance of the piece parts, and/or lose the ability to expand into new areas of the original ontologies. The latter may happen because those ontologies are not "in front" of you ("out of sight, out of mind") or because you have deviated too far from the original semantics and structure.

In my next posts, I will continue to discuss a design for the document-related ontologies (focusing on the immediate needs to reflect the Dublin Core metadata and the existence/location of the documents). In the meantime, let me know if I missed any valuable references, or if you have other ideas for the ontologies.


Sunday, October 19, 2014

Breaking Down the "Documents and Policies" Project - Competency Questions

Our previous post defined a project for which a set of ontologies is needed ... "What access and handling policies are in effect for a document?" So, let's just jump into it!

The first step is always to understand the full scope of work and yet to be able to focus your development activities. Define what is needed both initially (to establish your work and ontologies) and ultimately (at the end of the project). Determine how to develop the ontologies, in increments, to reach the "ultimate" solution. Each increment should improve or expand your design, taking care to never go too far in one step (one development cycle). This is really an agile approach and translates to developing, testing, iterating until things are correct, and then expanding. Assume that your initial solutions will need to be improved and reworked as your development activities progress. Don't be afraid to find and correct design errors. But ... Your development should always be driven by detailed use cases and (corresponding) competency questions.

Competency questions were discussed in an earlier post, "General, Reusable Metadata Ontology - V0.2". (They are the questions that your ontology should be able to answer.) Let's assume that you and your customer define the following top-level questions:
  • What documents are in my repositories?
  • What documents are protected or affected by policies?
  • What documents are not protected or affected by policies? (I.E., what are the holes?)
  • What policies are defined?
  • What are the types of those policies (e.g., access or handling/digital rights)?
  • What the details of a specific policy?
  • Who was the author of a specific policy?
  • List all documents that are protected by multiple access control policies. And, list the policies by document.
  • List all documents that are affected by multiple handling/digital rights policies. And, list the policies by document.
These questions should lead you to ask other questions, trying to determine the boundaries of the complete problem. Remember that it is unlikely that the customers' needs will be addressed in a single set of development activities. (And, work will hopefully expand with your successes!) Often, a customer has deeper (or maybe) different questions that they have not yet begun to define. Asking questions and working with your customer can begin to tease this apart. Even if the customer does not want to go further at this time, it is valuable to understand where and how the ontologies may need to be expanded. Always take care to leave room to expand your ontologies to address new use cases and semantics.

This brings us back to "General Systems Thinking". It is important to understand a system, its parts and its boundaries.

Here are some follow-on questions (and their answers) that the competency questions could generate:
  • Q: Given that you have document repositories, how are the documents identified and tagged?
    • A: A subset of the Dublin Core information is collected for each document: Author, modified-by, title, creation date, date last modified, keywords, proprietary/non-proprietary flag, and description.
  • Q: How are the documents related to policies?
    • A: Policies apply to documents based on a combination of their metadata.
  • Q: Will we ever care about parts of documents, or do we only care about the documents as a whole?
    • A: We may ultimately want to apply policies to parts of documents, or subset a document based on its contents and provide access to its parts. But, this is a future enhancement.
  • Q: Do policies change over time (for example, becoming obsolete)?
    • A: Yes, we will have to worry about policy evolution and track that.
  • Q: What policy repositories do you have?
    • A: Policies are defined in code and in some specific content management systems. The goal is to collect the details related to all the documents and all the policies in order to guarantee consistency and remove/reduce conflicts.
  • Q: Given the last 2 competency questions, and your goal of removing/reducing conflicts, would you ultimately like the system to find inconsistencies and conflicts? How about making recommendations to correct these?
    • A: Yes! (We will need to dig into this further at a later time in order to define conflicts and remediation schemes.)
Well, we now know more about the ontologies that we will be creating. Initially, we are concerned with document identification/location/metadata and related access and digital rights policies. We can then move onto the provenance and evolution of documents and policies, and understanding conflicts and their remediation.

So, the next step is to flesh out the details for documents and policies. We will begin to do that in the next post.