Data and Datasets overview

This note provides some background on various notions of "data" and "dataset" related to Schema.org.

Schema.org as a project, and as a collection of terms, is entirely devoted to data. In other words, it always provides, characterises, describes, or encodes some form of data. Schema.org defines particular types such as Event, NewsArticle, Review, Person, as well as properties that characterize and interlink instances of these types. For example, the alumni property links people with educational organizations. The alumni property exists to provide information about people being alumni of organizations; Volcano exists to provide information about volcanoes, and so on. However, there can sometimees be confusion when the thing we are providing information about, is itself thought of as (typically a bundle of) data.

Schema.org itself also contains some dedicated vocabulary that can be used in applications which publish, discover or integrate different kinds of data. Just as schema.org defines vocabulary to help describe people, volcanos and public toilets, it can also be used to describe data. This capability is in addition to schema.org's general nature as a collection of structured data schemas, and complements numerous other data-related formats and standards.

In particular, schema.org defines vocabulary for providing Dataset metadata, alongside (proposed) vocabulary for describing aggregate statistics:

To take a specific example, the Volcano type in schema.org is useful for volcano data, but in a different way from a Dataset type being used to describe a collection of data about volcanos (e.g. in CSV or XML format). Similarly, the Population / Observation types can be used to represent aggregate statistics of "populations" of volcanos. While http://schema.org/Volcano can be used to directly provide information about specific volcanos; the http://schema.org/Dataset and http://schema.org/Observation types emphasise the data level of abstraction more directly.

Other related work includes W3C's CSVW and RDF Data Cube specifications, as well as the DSPL 2.0 specification. DSPL 2.0 combines Schema.org for per-dataset metadata with the use of CSV files to represent code lists, enumerations and statistical observations. DSPL2 provides an explicit high-fidelity representation of datasets in their own terms, rather than mapping everything into Schema.org.

These technologies all in turn depend on lower-level standards, such as for JSON-LD, RDFa, Microdata, XML, Unicode etc., and share a broadly RDF-like approach to representing information. There are also related standards from W3C and elsewhere dedicated to lifting factual data out of various kinds of dataset, into RDF statements that use vocabularies such as Schema.org. For examples, see R2RML, which addresses this for SQL; GRDDL for XML via XSLT; the CSVW to RDF mappings for static tabular data, and JSON-LD's context mechanism for certain forms of JSON data.