Data Quality Tutorial

Monday, September 12, 2016 - 09:00 to 17:00
Campus Augustusplatz

Hosted by: Dimitris Kontokostas and Helmut Nagy

The Data Quality Tutorial tries to provide a good overview of how Data Quality can be handled in practice, especially in the case of RDF and Linked Data. The tutorial is split in three thematic groups: a) Existing & emerging technologies, b) Data Quality from the industry perspective and c) (applied) research approaches for tackling quality

Presentation can be found here.

09:00 - 09:30   Data quality dimensions & metrics by Amrapali Zaveri, Stanford University The development and standardization of semantic web technologies has resulted in an unprecedented volume of data being published on the Web as Linked Data (LD). However, we observe widely varying data quality ranging from extensively curated datasets to crowdsourced and extracted data of relatively low quality. In this talk, I will present the results of a survey conducted for gathering all the approaches for assessing the quality of LD. The survey unified and formalized commonly used terminologies across 30 core approaches related to data quality and provide a comprehensive list of 18 quality dimensions and 69 metrics. Additionally, a set of 12 tools were qualitatively analyzed using a set of attributes. The aim of this talk is to provide researchers and data curators a comprehensive understanding of existing work, thereby encouraging further experimentation and development of new approaches focused towards data quality, specifically for LD.
09:30 - 10:30   OWL-based validation by Gavin Mendel Gleasonand Bojan Bozic, Trinity College, Dublin OWL is a rich ontology language for domain modelling. We use OWL ontology descriptions to constrain, describe and enter linked data in a consistent manner.
10:00 - 10:30   Coffee Break  
11:00 - 12:00   SHACL-based validation by Dimitris Kontokostas, University of Leipzig SHACL is the upcoming W3c standard for defining constraint rules on RDF graphs. In this presentation we will provide an overview of SHACL and best practices on defining shapes for your data.
12:00 - 12:30   Crowdsourcing + mixed approaches for validation by Amrapali Zaveri, Stanford University With the vast amounts of Linked Data on the Web, the main challenge facing consumers is the poor data quality. Current approaches for assessment are either automated or semi-automated. However, detecting certain quality problems still require human intervention, which usually comes at a higher price; either in monetary rewards or in the form of effort to recruit participants in a volunteer setting. Crowdsourcing, on the other hand, employs workers from microtask crowdsourcing platforms such as Amazon Mechanical Turk to perform tasks for a minimal monetary reward. Crowdsourcing thus offers a formidable and readily-available workforce at relatively low fees. In this talk, I will present several success stories for crowdsourcing linked data quality assessment in particular that can prove to be an affordable and viable solution that can be used in combination with existing approaches.
12:30 - 13:30   Lunch  
13:30 - 14:00   Quality management in PoolParty by Helmut Nagy, Semantic Web Company PoolParty already had implemented quality management based onqskosto evaluate consistency of SKOS vocabularies.
In course of the ALIGNED project RDFUnit has replaced qskos to provide a more flexible and extendable framework for implementing quality management for SKOS vocabularies and in future also beyond that.
In addition repair mechanism have been implemented allowing to fix issues on import.
We will show what benefits this has created and what we plan to do next.
14:00 - 14:30   JURION quality assurance by Christian Dirschl, Wolters Kluwer Wolters Kluwer has successfully implemented ALIGNED technologies like RDFUnit in its operational systems. We will show what benefits this has created and what we plan to do next.
14:30 - 15:00   Customer needs for Data quality by Irene Polikoff, TopQuadrant TopBraid vocabulary management and data governance products (TopBraid EVN and TopBraid EDG) include user friendly ability to define data quality rules in the web UI. The data quality rules used by our customers range from very simple and generic (e.g., defining mandatory fields) to fairly complex and domain specific (e.g., when the status of a loan is “funded”, the value of the funding-date must not be NULL and the value of the loan-amount must be greater than zero.)
While our products directly manage “golden copy” of controlled vocabularies and reference data, this information may also reside in other systems - databases, CMS, portals, etc. In cases of metadata management for master or transactional data, data will nearly always reside in external systems.
We will describe different use cases for data quality rules through customer examples. These range from running data checks directly on data our products manage, to providing services to check conformance of data residing elsewhere, to simply serving as a knowledgebase of rules that gets consulted for execution by other systems.
15:00 - 15:30   Coffee Break  
15:30 - 16:00   OOPS!: on-line ontology diagnosis by Maria Poveda, Ontology evaluation, which includes ontology diagnosis and repair, is a complex activity that should be carried out in every ontology development project. OOPS! (OntOlogy Pitfall Scanner!) is an on-line system that allows ontology engineers to (semi)automatically diagnose their ontologies. The system is based on a catalogue that describes 41 pitfalls that ontology developers might include in their ontologies. By the time being,OOPS! implements 33 out the 41 pitfalls defined in the catalogue.
16:00 - 16:30   LOD Laudraumat by Wouter Beek, VU University Amsterdam and Triply For the past 15 years the Semantic Web community has tried to improve data quality through standardization, formulation of best practices, improvement of tooling and education. The LOD Laundromat takes a radically different approach: it improves the quality of a copy of the entire LOD Cloud instantaneously, through automated means. This talk will focus on the interaction of these two data quality paradigms. Specifically, it will explore possibilities for improving the quality of the original datasets in the LOD Cloud based on improvements made by the LOD Laundromat.
16.30 - 17.00   Mappings Validation by Anastasia Dimou, Ghent University - iMinds Nowadays quality assessment is primarily performed after Linked Data is published. However, it is observed that the most frequent violations are related to the dataset’s schema, namely the way vocabularies or ontologies are applied to annotate the original data, the so-called mappings. The more combinations of different ontologies and vocabularies are used, the higher the likelihood of appearing such violations. In this talk, I will present how quality assessment may be applied to mappings, improving Linked Data quality before they are even generated, by showcasing how RDFUnit test cases were applied on mappings defined with RML.

Short CVs of speakers

Amrapali Zaveri

is a postdoctoral researcher at Stanford University since September 2015. She completed her PhD from University of Leipzig, Germany. Her research interests include data quality, knowledge interlinking and fusion, biomedical and health care research. As part of her research, she investigated the various aspects concerning data quality with special emphasis on Linked Data. In the process, she conducted a comprehensive survey of the existing data quality assessment methodologies currently available to evaluate the quality of linked datasets. She has also been working on crowdsourcing methodologies for the assessment and improvement of Linked Data Quality. She was involved as a guest co-editor of the special issue on Linked Data Quality in the International Journal on Semantic Web and Information Systems in 2014. She was a co-organizer of the 2nd and 3rd Workshops on Linked Data Quality.  She is currently a guest co-editor of the special issue on Quality Management of Semantic Web Assets (Data, Service and Systems) in the Semantic Web Journal. She is the co-chair of the Research and Innovation track of SEMANTiCS 2016.

Dr. Gavin Mendel-Gleason

is a Research Fellow at Trinity College focusing in logic, constraint checking, validation, ontologies and database design. He is also interested in applications of type theory to ontology validation. He is working on the ALIGNED European H2020 project.

Dr. Bojan Božić

is a Research Fellow at Trinity College Dublin. He is currently working on the European H2020 project ALIGNED. His expertise and areas of interest span the fields of Ontology Development and Validation, Reasoning, and Metadata Quality.

Irene Polikoff

Irene Polikoff has more than two decades of experience in software development, management, consulting and strategic planning. Since co-founding TopQuadrant in 2001 Irene has been involved in more than a dozen projects in government and commercial sectors. She has written strategy papers, trained customers on the use of the Semantic Web standards, developed ontology models, designed solution architectures and defined deployment processes and guidance.

Dimitris Kontokostas

is a PhD candidate at the AKSW group of Leipzig University since September 2012. Dimitris’ core research interests are Linked Data quality and application of quality techniques on the Library domain and DBpedia. He is one of the lead developers of DBpedia Extraction Framework, has developed several tools for Linked Data validation. He is one of the editors of the W3C SHACL specification, one of the initiators of the Workshop on Linked Data Quality (LDQ) series and a co-organizer of the 1st Workshop on NLP&DBpedia (ISWC 2013). He is currently a guest co-editor of the special issue on Quality Management of Semantic Web Assets (Data, Service and Systems) in the Semantic Web Journal.

Helmut Nagy

is COO of the Semantic Web Company and responsible for the coordination of the development, documentation and quality assurance of the PoolParty product family and the customer support. Additionally he is involved as senior consultant in customer projects (industry and public administration) and participates in research projects like LOD2 and ALIGNED. Helmut has worked with SWC since 2010. Before that he worked in the field of technical documentation and the use of social software especially wikis for improving the communication and collaboration in companies. Helmut has a Master of Arts degree in Media Studies & Communication Science and German Philology from the University of Vienna. He is based in Vienna, Austria.

Christian Dirschl

is Chief Content Architect and head of Content Strategy and Architecture at Wolters Kluwer Germany. He is responsible for the content structures, metadata, taxonomies, and thesauri within Wolters Kluwer Germany. He manages text mining and automatic topical classification projects. He also represents Wolters Kluwer Germany in international research projects like LOD2, ALIGNED or WDAqua. Christian has worked with Wolters Kluwer Germany since 2001. Before that, he worked as an international IT consultant in several software companies. Christian has a Master of Arts degree in Information Science from the University of Regensburg. He is based in Munich, Germany.

Maria Poveda Villalón

is a research fellow at Ontology Engineering Group at Universidad Politécnica de Madrid. She has received her PhD in Artificial Intelligence from the Universidad Politécnica de Madrid in 2016. Her research activities are focused on Ontological Engineering and Semantic Web. More specifically, she is interested in the areas of Knowledge Modelling (including conceptualization, formalization and implementation), Ontology Evaluation, and Ontology Design Patterns. Also, she is currently working in Linked Data modelling, generation and exploitation. As part of her PhD thesis she has developed OOPS! (OntOlogy Pitfall scanner!), an online system for evaluating ontologies that has been broadly accepted by a high number of users worldwide and has been used from 60 different countries. OOPS! is integrated with third-party software and is locally installed in private enterprises being used both for ontology development activities and for training courses.

Wouter Beek

received his Master's in Logic from the Institute for Logic Language and Computation (ILLC).  He is currently a PhD student at VU University Amsterdam (VUA), working in the Knowledge Representation & Reasoning (KR&R) group. His research focuses on the development, deployment and analysis of large-scale heterogeneous knowledge bases and the way in which they enable unanticipated and innovative reuse.  Wouter is the principle developer of the LOD Laundromat and LOD Lab.  He is also co-founder of

Anastasia Dimou

is a Scientific Researcher at Ghent University - iMinds since February 2013. Her research interests include Linked Data Generation and Publication, Data Quality and Integration, Knowledge Representation and Management. As part of her research, she investigated a uniform language for describing the mapping rules for generating high quality Linked Data from multiple heterogeneous data formats and access interfaces. Anastasia is also working  on Linked Data generation and publishing workflows.