IK Open Data

Introduction

Motivation

Base-year and historical data for key transport quantities differ across global transport energy models. These differences impede understanding of the state and fundamental trends in the global transport system, and the generation of useful, model-based knowledge about future challenges and policy options for transport system transitions. Data differences arise from differences in:

  • the concepts measured,
  • sources of derived and pre-processed data, and
  • processes for cleaning and harmonization.

Goals

The iTEM-KAPSARC (IK) Open Data project aims to meet the call from iTEM3 (2017) for a common, public, “best available” database for baseline calibration of models, provided through a transparent, scientific process.

  1. Collate publicly-available historical transport data.
  2. Make these easily accessible to modellers through open source software.
  3. Develop methods for transforming existing data and transparently handling data quality.
  4. Publish software implementing these methods, to allow transport researchers to customize steps, assumptions, etc. in data processing.
  5. Continually increase scope and resolution in multiple dimensions:
    • Spatial: from regions to individual countries and sub-national regions.
    • Conceptual: to more disaggregate transport modes, vehicle types, and fuels.
    • Temporal: extend from the recent past into the 20th century.

Data process & concepts

The project is organized around a target process and terminology of data:

iTEM-KAPSARC Open Data diagram
Data

Collections of observations of specific measures for general concepts, organized in one or more dimensions, with attributes:

  • Concept: both background concepts and specific, systematic, defined meanings. Example: ‘energy demand’ and ‘fuel use’.
  • Measure: an operational definition, including units, of a systematic concept. Multiple measures may exist for the same concept. Example: ‘fuel use’ may be measured in terms of the volume of fuel (litres) or its energy content (joule).
  • Observation: a single value for a measure.
  • Dimension: a named list of labels or values used to organize multiple observations in a set of data. Example: ‘year’ (a sequential list of annual periods), ‘country’ (names or codes for countries).
  • Attribute: any information associated with an observation or group of observations. Example: the attribute ‘status’ might have a value of “Provisional” or “Final”, related to a statistical agency's process of publishing preliminary and then final values.
Raw data

Primary sources of data. These may include:

  • Country sources: Data published by national organizations such as national statistical agencies, ministries of transport or energy, etc. who directly measure quantities or collect measurements from subsidiary organizations.
  • Existing aggregations: Data collected and assembled into larger data sets by various organizations. These may include data from multiple upstream sources (such as country sources), with or without any cleaning, adjustment, or harmonization.
  • Modelled raw data: Data that are produced from an existing model or calculations.
Assumptions

Quantities used in data processing.

  • Conversion factors: used to convert between alternate measures of the same concept. Example: energy content of fuel is used to convert ‘fuel use’ from volume to energy units.
  • Parameters: used to derive one measure from another in a data processing calculation. Example: ‘occupancy of passenger vehicles’ (persons per vehicle) is used to calculate ‘passenger travel’ (in kilometres) from ‘vehicle travel’ (kilometers).
Data processing
Algorithms that combine raw data and assumptions to produce a dataset with greater coverage or quality; or to derive certain measures from raw data.
Ingestion
Operations that begin with raw or derived data and change its format to be suitable for use in other software or models.

Contributing

Core contributors are:

iTEM participants

…in particular the iTEM organizing group and affiliated researchers.

  • Identify of public data sets.
  • Develop of software pipeline for data processing.
  • Use the processed database for modeling and research.
KAPSARC

(the King Abdullah Petroleum Studies and Research Center):

  • Host raw and processed data via the KAPSARC Data Portal. (21 on-road datasets hosted as of April 2019)
  • Develop and maintain APIs for retrieval and storage of data.
  • Ingest raw data from public sources, including automatic updating as primary sources are updated.

We also welcome contributions of many kinds, including:

  • Information about existing public data sets; in particular, country-specific sources.
  • Existing codes and procedures for correcting known issues in particular data sets.
  • Opinions or academic literature giving best-practices for handling data for specific transport concepts.
  • Software development in Python and R.

In particular, since the space of all transport data is large, the interests of the iTEM and broader transport research community will dictate priorities for improving coverage and/or quality.