Datasets and data services
What are “datasets” and "data services" in the RING
In the RING, the two main sections for searching information sources are “Datasets” and “All data services”. Under All data service you will find all the services that have been registered to the RING, while under Datasets you will find only data services that make at least one collection of data available for access or download in one or more formats.
The following definitions explain this distinction better.
“Data services”
Data services is a generic term for any type of data service on the web, from a simple website to a search engine to an application programmable interface to a data dump.
A more technical definition of what is considered "data service" in the RING is: any platform that provides data services from one server instance (website, mail server, web services endpoint, XML archive) to any client (browsers, email clients, news readers, special protocol clients).
Any service that is registered in the RING will be listed in the “All data services” section.
“Datasets”
Dataset is a more specific term that has been defined in several ways[1], all of which further specify or extend the basic concept of “a collection of data”.
The way datasets are conceived in the RING follows the definition given by the W3C Government Linked Data Working Group to the concept of “dataset”: a dataset is “a collection of data, published or curated by a single source, and available for access or download in one or more formats”. According to the same definition, the “instances” of the dataset “available for access or download in one or more formats” are called “distributions”: a distribution is “a specific available form of a dataset. Each dataset might be available in different forms, these forms might represent different formats of the dataset or different endpoints. Examples of distributions include a downloadable CSV file, an API or an RSS feed”.
Therefore, datasets in the RING are a subset of the more generic data services and comprise only the services that make a collection of data available for machine-access or download in one or more formats (distributions). The word “access” here has a specific technical meaning indicating machine-access at a certain address through a certain protocol, not just access through a web user interface (therefore, an online catalog search is not a dataset). In the same way “in one or more formats” here means in one or more machine-processable formats (therefore, a downloadable Word or PDF file with a list of bibliographic citations is not a dataset).
A special note is needed regarding dynamic dataset endpoints. Many data services that give direct access to the data in machine-readable format do not just expose static datasets for individual download, but provide an endpoint (REST API, web service, SPARQL or OAI-PMH endpoint) to query the data and retrieve a dynamically created dataset.
In these cases, the description of the access point should ideally provide the technical details of the protocol, parameters and response format in order to allow for the direct retrieval of data.
Important: the endpoint has to be an endpoint for machines, which can be called directly by an application and returns a dataset, not a web search engine where humans search and download data.
The RING considers these services as datasets because when calling the endpoint URL with the appropriate parameters the result is a dataset. Although the resulting dataset is not necessarily the whole data collection but most probably a subset.
Examples
A website that has a search engine where the user can interactively search and browse and even download a collection of data is not a dataset, while the following can all be considered datasets in the RING:
Full datasets:
- an RSS feed reachable at a URL;
- an XML dump downloadable via FTP or reachable at a URL;
- a CSV or NetCDF file vailable at a URL;
- an API call already parametrized to retrieve a specific dataset
Dynamic dataset endpoints:
- a SPARQL engine that responds to a query with an RDF response that represents a dataset;
- an OAI-PMH target that responds to a verb call with an XML response;
- any web service or API endpoint whose response is a dataset.
Registering datasets
The template for registering a dataset and a data service is the same: the only difference is that when registering a dataset the “Access” metadata are mandatory. When users want to register a dataset, they must start registering a “Dataset / data service” and then, under the “Access to data” tab, fill in the information about the available distributions of the datasets: only records that have the “access to data” tab filled are included in the Dataset search.
Following the above definitions, any information service that is registered in the RING is listed among the generic “data services”, while only those services for which at least one “distribution of data” is available for access or download in one or more formats are listed among the datasets.
For the purposes of data sharing and re-use and the building of better information and data services, registering a service with at least one real accessible “dataset” goes a much longer way than registering just a website or an interactive search engine: data in a dataset are re-usable, data behind a search engine are not.
[1] Wikipedia: http://en.wikipedia.org/wiki/Data_set
W3C Government Linked Data Working Group: http://www.w3.org/TR/vocab-dcat/#class--dataset
DISC, Data Information Specialists Committee – UK: http://www.disc-uk.org/qanda.html.
Download the RING Handbook below to learn more on the registration procedure.