An R package to search and retrieve data from Eurostat database using SDMX
‘restatapi’ can be installed from CRAN by
install.packages("restatapi")
or use the development version from GitHub
::install_github("eurostat/restatapi") remotes
This package is similar to other packages like the eurodata, eurostat, rdbnomics, RJSDMX or TSsdmx which can be
used to download data from Eurostat database. The difference is that
restatapi
is based on SDMX (Statistical Data and Metadata
eXchange) and XML to search and retrieve filtered datasets and use the
TSV (tab separated values) bulk download facility to get whole data
tables. The code was written in a way that the number of dependencies on
other packages should be very small. The restatapi
package
provides flexible filtering options, data caching, and uses the
parallel
and data.table
package to handle
large dataset in an efficient way.
The package contains 8 main functions and several other sub functions in 3 areas.
get_eurostat_toc
function downloads the Table of
Contents (TOC) of all Eurostat
datasets.search_eurostat_toc
function provides the facility
to search for phrase, pattern and regular expressions in the TOC and
returns the rows of the TOC where the search string(s) found.get_eurostat_dsd
function returns the Data
Structure Definition (DSD) of a given dataset containing the possible
dimensions and values with their labels.search_eurostat_dsd
function provides the facility
to search for phrase, pattern and regular expressions in the DSD and
returns the rows of the DSD where the search string(s) found.create_filter_table
function creates a filter table
based on the DSD and the search expressions which can be applied on the
local computer to filter out data from whole data tables.get_eurostat_raw
function downloads the full data
table as it is either using the TSV format (default) or the SDMX format
keeping all the column names and rows as it is in the original
files.get_eurostat_bulk
function downloads the full data
set keeping only a unique frequency with standardized column names and
removing those columns which do not contain additional information, like
frequency and time format.get_eurostat_data
function retrieves a data table
which can be labeled using the labels from the DSD. The table can
contain the whole datasets or only part of it if filters are applied. If
after the filtering the number of observations is to large, then the
whole dataset is downloaded and the filter applied on the local
computer. If no filter used, it is equivalent with the
get_eurostat_bulk
function, but in this case labels can be
applied.Below there are examples demonstrating the main features, the detailed documentation of the functions is in the package.
Next to the functions the package contains a list of country codes for different groups of European countries based on the Eurostat standard code list, e.g.: European Union (EU28, …, EU6), Euro Area (EA19, …, EA11) or New Member States (NMS13, …, NMS2).
Example 1: First set the number of cores/threads
(restatapi_cores
) to 3 and download the XML version of the
the English Table of Contents (TOC). Then change the file download
method (dmethod
) from the default auto
to
libcurl
in case there is problem with the system defaults,
and download the txt version (mode="txt"
) of the TOC
showing the detailed debugging messages (verbose=TRUE
). The
debugging information can show if there is a problem with the internet
connection, as it provides the URL which is used to download the data
from the API. The provided URL can be copied and checked in a regular
browser if the API gives a response or not. Finally, search not case
sensitive (ignore.case=TRUE
) for the word
energie
in the German version (lang="de"
) of
the TOC.
options(restatapi_cores=3)
get_eurostat_toc()
options(restatapi_dmethod="libcurl")
get_eurostat_toc(mode="txt",verbose=TRUE)
search_eurostat_toc("energie",lang="de",ignore.case=TRUE)
Example 2: Download the Data Structure Definition
(DSD) for the ‘Financial services - quarterly data’
(ei_bsfs_q
) datasets. First search in the DSD the “EU”
pattern everywhere case sensitive. Then search only in the code list
(name=FALSE
).
<-get_eurostat_dsd("ei_bsfs_q")
dsdsearch_eurostat_dsd("EU",dsd)
search_eurostat_dsd("EU",dsd,name=FALSE)
Example 3: Download the raw dataset
avia_gor_me
(‘Freight and mail air transport between the
main airports of Montenegro and their main partner airports (routes
data)’) from the Tab Separated Value (tsv) file and keep it as it is
(melt=FALSE
). In his case each time period will be in
separate columns. Then download the same whole dataset using the SDMX
(xml
) version. In this case the data will be melted.
Finally using the get_eurostat_bulk
function to download
the same dataset. In this case the data will contain only a unique time
period (the most frequent one - monthly data) with standardized column
names (‘time’, ‘values’ and ‘flags’ in case keep_flags=TRUE
used).
get_eurostat_raw("avia_gor_me",melt=FALSE)
get_eurostat_raw(" Avia_gor_ME","xml")
get_eurostat_bulk("AVIA_GOR_ME ")
Example 4: Download the whole data table of ‘GDP and
main components’ (nama_10_gdp
), then check if the provided
id
is in the Table of Contents
(check_toc=TRUE
). If it is correct, then do not use the
cached version from the memory (.restatapi_env
), but rather
download it again and update the dataset in the cache
(update_cache=TRUE
). Finally, change the cache directory
from memory to hard disk to a temporary folder
(cache_dir="/tmp"
, /tmp
is a typical temporary
folder on Unix-like systems) and download there the whole data table
keeping all non-numeric values as string instead of converting to
factors (stringAsFactors=FALSE
), and keeping the lines
without values which were suppressed due to confidentiality (having the
‘c’ flag, cflags=TRUE
).
get_eurostat_data("NAMA_10_GDP")
get_eurostat_data("nama_10_gdp",update_cache=TRUE,check_toc=TRUE)
get_eurostat_data("nama_10_gdp",cache_dir="/tmp",stringAsFactors=FALSE,cflags=TRUE)
Example 5: Set the option, that always download the
dataset from the internet and do not use the cached version of the
data(restatapi_update=TRUE
). In this case it is not
necessary define this option for each data query. Then change the
default cache from memory (.restatapi_env
) to hard disk
(restatapi_cache_dir=file.path(tempdir(),"restatapi")
), to
the folder restatapi
inside the R temporary folder. This is
the default cache directory on the hard disk, which will be created, in
case the provided cache_dir
folder does not exist.
options(restatapi_update=TRUE)
options(restatapi_cache_dir=file.path(tempdir(),"restatapi"))
Example 6: First
download the annual (select_freq="A"
) air passenger
transport data for the main airports of Montenegro
(avia_par_me
) and do not cache any of the data
(cache=FALSE
). Then from the same table download the
monthly (select_freq="M"
) and quarterly
(filters="Q...
) data for 2 specific airport pairs/routes
(filters=...ME_LYPG_HU_LHBP+ME_LYTV_UA_UKKK"
) in August
2016 and on 1 July 2017
(date_filter=c("2016-08","2017-07-01")
). The filters are
provided in the format how it is required by the REST
SDMX web service. Under the old API, it returned the value for the
selected routes for the month August 2016, July 2017 and the 3rd quarter
of 2017. Meanwhile under the new API, it
returns all the quarterly and monthly value, as there is a single day in
the date_filter
. Then download again the monthly and
quarterly data (filters=c("Quarterly","Monthly")
) where
there is exact match in the DSD for “HU” for August 2016 and 1 March
2014 (date_filter=c("2016-08","2014-03-01")
). This query
will provide only monthly data for 2016, as the quarterly data is always
assigned to the first month of the quarter and there is no data for
2014. Since there is no exact match for the “HU” pattern, it returned
all the monthly data for August 2016 and put the labels (like the name
of the airports and units) so the data can be easier understood
(label=TRUE
) under the old API. Under the
current API, it returns all the quarterly and
monthly data as there is a single day in the date_filter
.
Finally, download only the quarterly data (select_freq="Q"
)
for several time periods
(date_filter=c("2017-03",2016,"2017-07-01",2012:2014)
, the
order of the dates does not matter) where the “HU” pattern can be found
anywhere, but only in the code
column of the DSD
(filters="HU",exact_match=FALSE,name=FALSE
). The result was
all the statistics about flights from Montenegro to Hungary in the 3rd
quarter of 2017, as there were no information for the other time periods
under the old API. Under the current API, it
gives back all the quarterly data in the dataset for flights from
Montenegro to Hungary because in the date_filter
there is a
single day. Before 2022, in the old dissemination chain the value was
assigned to the first day of the month, quarter and year, so it
was enough to filter for one day to get the value for the whole period.
Under the current API the value belongs to the full period. If a date
range does not cover the whole period no value is returned. For example,
to get the value of the whole quarter the date filter should start at
least on the first date of the quarter and end at least on the last day
of the quarter. With exact numerical example to get the value for
2022/Q3, the startDate
should be 2022-07-01 or earlier and
the endDate
should be 2022-09-30 or later. In the old
version of the API it was enough if the period included the day
2022-07-01 only.
<-get_eurostat_data("avia_par_me",select_freq="A",cache=FALSE)
dt<-get_eurostat_data("avia_par_me",
dtfilters="Q...ME_LYPG_HU_LHBP+ME_LYTV_UA_UKKK",
date_filter=c("2016-08","2017-07-01"),
select_freq="M")
<-get_eurostat_data("avia_par_me",
dtfilters=c("HU","Quarterly","Monthly"),
date_filter=c("2016-08","2014-03-01"),
label=TRUE)
<-get_eurostat_data("avia_par_me",
dtfilters="HU",
exact_match=FALSE,
date_filter=c("2017-03",2016,"2017-07-01",2012:2014),
select_freq="Q",
label=TRUE,
verbose=TRUE,
name=FALSE)
Example 7: Download from the Time Use Survey (TUS)
data (tus_00age
) for 2010 (date_filter=2010
)
in Hungary how much time spent with travel on average. If someone does
not know the exact codes then the filter patterns
(filters=c("total","time spent","HU",'travel')
) can be
searched in the labels (label=TRUE
) non case sensitive
(ignore.case=TRUE
) without forcing exact match of the
patterns (exact_match=FALSE
). The first function call will
result an empty data table as the SDMX webservice will provide NaN (Not
a Number) response for the time values (hh:mm). But these values are
included in the bulk download files, and can be retrieved by forcing the
filtering on the local computer (force_local_filter=TRUE
).
Then the retrieved values can be summed using the chron package.
<-get_eurostat_data("tus_00age",
dtfilters=c("HU","total","time spent",'travel'),
date_filter=2010,
exact_match=FALSE,
ignore.case=TRUE,
label=TRUE)
<-get_eurostat_data("tus_00age",
dtfilters=c("HU","total","time spent",'travel'),
date_filter=2010,
force_local_filter=TRUE,
exact_match=FALSE,
ignore.case=TRUE,
label=TRUE)
dt!="Total",sum(chron::times(paste0(values,":00")))] dt[acl00
Example 8: Download the data on the production of
cow’s milk on farms by NUTS 2 regions (agr_r_milkpr
) first
only for the new Member States joined in 2004 and keeping only the
period between March 2009 and 5 June 2011
(date_filter="2009-03:2011-06-05"
). The country code of the
member states can be loaded from the .restatapi env
(eu<-get("cc",envir=.restatapi_env)
) and it can be used
in the query (filters=eu$NMS10
). Then get all the data for
Belgium for all NUTS
level from the same data set before July 2009
(date_filter="<2009-07"
) with the labels
(label=TRUE
) and the observation status (so called
flags) information (flags=TRUE
). In this case for
filter (filters="BE"
) the exact matching of pattern should
be turned off (exact_match=FALSE
) to get not just at
country (NUTS0) level. Finally, get the data at for Hungary at NUTS2
level after 19 May 2017 (date_filter="2017-05-19<
) and
keeping the lines which were removed due to confidentiality
(cflags=TRUE
). For this we do not have to know the exact
code, name or number of the NUTS2 regions as we can provide regular
expression in the filter (filters=c("^HU..")
) and providing
the option that the expression is Perl compatible
(perl=TRUE
).
<-get("cc",envir=.restatapi_env)
eu<-get_eurostat_data("agr_r_milkpr",
dtfilters=eu$NMS10,
date_filter="2009-03:2011-06-05")
<-get_eurostat_data("agr_r_milkpr",
dtfilters="BE",
date_filter="<2009-07",
keep_flags=TRUE,
exact_match=FALSE,
label=TRUE)
<-get_eurostat_data("agr_r_milkpr",
dtfilters=c("^HU.."),
date_filter="2017-05-19<",
cflags=TRUE,
perl=TRUE)
Example 9: Download the balance
(stk_flow="BAL"
) from the international trade in services
dataset (bop_its6_det
) for 2020
(date_filter=2020
), transport services
(bop_item="SC"
), with reporting countries Hungary and the
EU (geo=c("EU27_2020","HU")
) and trading partner outside EU
(partner="EXT_EU27_2020"
). In order to avoid that Hungary
shows up in the partner countries the filter should be defined as a
named list
(filters=list(bop_item="SC",partner="EXT_EU27_2020",geo=c("EU27_2020","HU"),stk_flow="BAL")
)
and do not search for the terms in the labels (name=FALSE
).
In this case the filter patterns only searched where the concept equals
to the name.
<-get_eurostat_data("bop_its6_det",
dtfilters=list(bop_item="SC",
partner="EXT_EU27_2020",
geo=c("EU27_2020","HU"),
stk_flow="BAL"),
date_filter=2020,
label=TRUE,
name=FALSE)
Example 10: After finishing the tasks the cache (in memory and on the hard drive) can be cleaned up.
clean_restatapi_cache(tempdir(),verbose=TRUE)