What is it?
News on the Web (now.ontotext.com) is a free public service, a live showcase of some of the basic capabilities of Semantic Technology. It relies on knowledge about the world to create structured data from text and to expand the knowledge about the world by feeding back new entities and the relations between them.
The methodology can work for every domain. This particular demo of NOW showcases the opportunities a dynamic semantic publishing platform opens up before media & publishing companies. In addition, we have a language-independent methodology for named entity recognition disambiguation (NERD), which allows us to expand not only into multiple domains, but also into multiple languages.
How does it work?
NOW has an RSS crawler which feeds documents from multiple sources into a custom processing component, which annotates the content and then stores it in GraphDB via the Concept API. Since all the components provide RESTful APIs, it is relatively easy to embed the dynamic semantic publishing platform into an architecture based on distributed messaging systems, such as Apache Kafka.
Several open linked datasets, including DBPedia 2015, WikiData, and GeoNames have been combined in NOW to create a high-coverage general purpose dataset, containing over 4 million People, Locations, Organizations, Animals, Plants and other Things. The dataset is used to perform automated concept and relation extraction, which basically generates the semantic fingerprint of an article. Further this fingerprint serves for building links between content, suggest similar content, provide facet and hybrid (concept + FTS) search. On top of this, the platform can also recognize things in the articles, which are not yet present in the dataset. This allows the dataset to be extended with automatically extracted data, for example, when new companies are found, or new people have become popular in the news.
The more data silos are linked, the better use of data applications could be made. For instance, in a large enterprise with a news agency and a scientific report unit, the latter would benefit from easy access to and search into the news agency's content, especially when the content is enriched with knowledge about the domain. In this way they can easily navigate from information, drill down on particular topics or aggregate content on a particular topic (naturally both units use a shared vocabulary about the world).
The entire technology is based on RDF, SPARQL & Linked Data standards.
Why is it important?
Although the platform does not currently rely on enterprise datasets, it is a good example of how it can be leveraged to create tools, both journalist and other customer facing solutions, using commercial/third party information. It can also power B2B applications or internal analytics engines.
Showcases such as NOW are a vital part of an enterprise because they enable it to communicate the capacity of the underlying technology to its clients, who are quite often representatives of the general public, without any background in or understanding of semantic technology.
In terms of technological maturity, there are user interfaces which abstract the end users from the underlying technology, but there is more work to be done to enable non-experts adapt the platform for their particular domain and use case.