unpivotr deals with non-tabular data, especially from spreadsheets. Use unpivotr when your source data has any of these ‘features’:
If that list makes your blood boil, you’ll enjoy the function names.
behead()
deals with multi-headered hydra tables one
layer of headers at a time, working from the edge of the table inwards.
It’s a bit like using header = TRUE
in
read.csv()
, but because it’s a function, you can apply it
to as many layers of headers as you need. You end up with all the
headers in columns.spatter()
is like tidyr::spread()
but
preserves mixed data types. You get into a mixed-data-type situation by
delaying type coercion until after the table is tidy (rather
than before, like read.csv()
et al). And yes, it usually
follows behead()
.More positive, corrective functions:
justify()
aligns column headers before
behead()
ing, and has deliberate moral overtones.enhead()
attaches a header to the body of the data,
a la Frankenstein. The effect is the same as
behead()
, but is more powerful because you can choose
exactly which header cells you want, paying attention to formatting
(which behead()
doesn’t understand).isolate_sentinels()
separates meaningful symbols like
"N/A"
or "confidential"
from the rest of the
data, giving them some time alone think about what they’ve done.partition()
takes a sheet with several tables on it,
and slashes into pieces that each contain one table. You can then
unpivot each table in turn with purrr::map()
or
similar.Unpivotr uses data where each cells is represented by one row in a dataframe. Like this.
What can you do with tidy cells? The best places to start are:
Otherwise the basic idea is:
devtools::install_github("tidyverse/readr#760")
.unpivotr::tidy_html()
unpivotr::as_cells()
– this should
be a last resort, because by the time the data is in a conventional data
frame, it is often too late – formatting has been lost, and most data
types have been coerced to strings.behead()
straight away, else
dplyr::filter()
separately for the header cells and the
data cells, and then recombine with enhead()
.spatter()
so that each column has one data type.library(unpivotr)
library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.4 ✔ readr 2.1.5
#> ✔ forcats 1.0.0 ✔ stringr 1.5.1
#> ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
#> ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
#> ✔ purrr 1.0.2
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ✖ tidyr::pack() masks unpivotr::pack()
#> ✖ tidyr::unpack() masks unpivotr::unpack()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
<- purpose$`up-left left-up`
x # A pivot table in a conventional data frame. Four levels of headers, in two
x #> X2 X3 X4 X5 X6 X7
#> 1 <NA> <NA> Female <NA> Male <NA>
#> 2 <NA> <NA> 0 - 6 7 - 10 0 - 6 7 - 10
#> 3 Bachelor's degree 15 - 24 7000 27000 <NA> 13000
#> 4 <NA> 25 - 44 12000 137000 9000 81000
#> 5 <NA> 45 - 64 10000 64000 7000 66000
#> 6 <NA> 65+ <NA> 18000 7000 17000
#> 7 Certificate 15 - 24 29000 161000 30000 190000
#> 8 <NA> 25 - 44 34000 179000 31000 219000
#> 9 <NA> 45 - 64 30000 210000 23000 199000
#> 10 <NA> 65+ 12000 77000 8000 107000
#> 11 Diploma 15 - 24 <NA> 14000 9000 11000
#> 12 <NA> 25 - 44 10000 66000 8000 47000
#> 13 <NA> 45 - 64 6000 68000 5000 58000
#> 14 <NA> 65+ 5000 41000 1000 34000
#> 15 No Qualification 15 - 24 10000 43000 12000 37000
#> 16 <NA> 25 - 44 11000 36000 21000 50000
#> 17 <NA> 45 - 64 19000 91000 17000 75000
#> 18 <NA> 65+ 16000 118000 9000 66000
#> 19 Postgraduate qualification 15 - 24 <NA> 6000 <NA> <NA>
#> 20 <NA> 25 - 44 5000 86000 7000 60000
#> 21 <NA> 45 - 64 6000 55000 6000 68000
#> 22 <NA> 65+ <NA> 13000 <NA> 18000
# rows and two columns.
<- as_cells(x) # 'Tokenize' or 'melt' the data frame into one row per cell
y
y#> # A tibble: 132 × 4
#> row col data_type chr
#> <int> <int> <chr> <chr>
#> 1 1 1 chr <NA>
#> 2 2 1 chr <NA>
#> 3 3 1 chr Bachelor's degree
#> 4 4 1 chr <NA>
#> 5 5 1 chr <NA>
#> 6 6 1 chr <NA>
#> 7 7 1 chr Certificate
#> 8 8 1 chr <NA>
#> 9 9 1 chr <NA>
#> 10 10 1 chr <NA>
#> # ℹ 122 more rows
rectify(y) # useful for reviewing the melted form as though in a spreadsheet
#> # A tibble: 22 × 7
#> `row/col` `1(A)` `2(B)` `3(C)` `4(D)` `5(E)` `6(F)`
#> <int> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 <NA> <NA> Female <NA> Male <NA>
#> 2 2 <NA> <NA> 0 - 6 7 - 10 0 - 6 7 - 10
#> 3 3 Bachelor's degree 15 - 24 7000 27000 <NA> 13000
#> 4 4 <NA> 25 - 44 12000 137000 9000 81000
#> 5 5 <NA> 45 - 64 10000 64000 7000 66000
#> 6 6 <NA> 65+ <NA> 18000 7000 17000
#> 7 7 Certificate 15 - 24 29000 161000 30000 190000
#> 8 8 <NA> 25 - 44 34000 179000 31000 219000
#> 9 9 <NA> 45 - 64 30000 210000 23000 199000
#> 10 10 <NA> 65+ 12000 77000 8000 107000
#> # ℹ 12 more rows
%>%
y behead("up-left", "sex") %>% # Strip headers
behead("up", "life-satisfication") %>% # one
behead("left-up", "qualification") %>% # by
behead("left", "age-band") %>% # one.
select(-row, -col, -data_type, count = chr) %>% # cleanup
mutate(count = as.integer(count))
#> # A tibble: 80 × 5
#> count sex `life-satisfication` qualification `age-band`
#> <int> <chr> <chr> <chr> <chr>
#> 1 7000 Female 0 - 6 Bachelor's degree 15 - 24
#> 2 12000 Female 0 - 6 Bachelor's degree 25 - 44
#> 3 10000 Female 0 - 6 Bachelor's degree 45 - 64
#> 4 NA Female 0 - 6 Bachelor's degree 65+
#> 5 27000 Female 7 - 10 Bachelor's degree 15 - 24
#> 6 137000 Female 7 - 10 Bachelor's degree 25 - 44
#> 7 64000 Female 7 - 10 Bachelor's degree 45 - 64
#> 8 18000 Female 7 - 10 Bachelor's degree 65+
#> 9 NA Male 0 - 6 Bachelor's degree 15 - 24
#> 10 9000 Male 0 - 6 Bachelor's degree 25 - 44
#> # ℹ 70 more rows
Note the compass directions in the code above, which hint to
behead()
where to find the header cell for each data
cell.
"up-left"
means the header (Female
,
Male
) is positioned up and to the left of the columns of
data cells it describes."up"
means the header (0 - 6
,
7 - 10
) is positioned directly above the columns of data
cells it describes."left-up"
means the header
(Bachelor's degree
, Certificate
, etc.) is
positioned to the left and upwards of the rows of data cells it
describes."left"
means the header (15 - 24
,
25 - 44
, etc.) is positioned directly to the left of the
rows of data cells it describes.# install.packages("devtools") # If you don't already have devtools
::install_github("nacnudus/unpivotr", build_vignettes = TRUE) devtools
The version 0.4.0 release had somee breaking changes. See
NEWS.md
for details. The previous version can be installed
as follow:
::install_version("unpivotr", version = "0.3.1", repos = "http://cran.us.r-project.org") devtools
unpivotr is inspired by Databaker, a collaboration between the United Kingdom Office of National Statistics and The Sensible Code Company. unpivotr.
jailbreaker attempts to extract non-tabular data from spreadsheets into tabular structures automatically via some clever algorithms. unpivotr differs by being less magic, and equipping you to express what you want to do.