Sharla Gelfand

February 2019: discog-purrr

For February’s #mememe2019 post, I thought it’s be fun to look at my music collection. I painstakingly found the correct versions of all my records and tapes and entered them into discogs, the music database, for analysis.

There is a discogs API, and an R package for it, too! The discogger package is created by Ewen lastnameunknown and provides a way to query your discogs collection via R.

library(discogger)
library(dplyr)
library(conflicted)

conflict_prefer("filter", "dplyr")

I’m querying my collection (you can see it here), and only keeping the content bit.

my_collection <- discogs_user_collection(user_name = "sharlagelfand")[["content"]]

The result of the API call is a deeply nested list, starting with a list for each item in my collection (157 listed). Each of those lists then contains the data: instance_id, rating, basic_information, folder_id, date_added, and id.

We can look at this using str():

my_collection %>%
  head(2) %>%
  str(max.level = 2)
## List of 2
##  $ :List of 6
##   ..$ instance_id      : int 354823933
##   ..$ rating           : int 0
##   ..$ basic_information:List of 11
##   ..$ folder_id        : int 1
##   ..$ date_added       : chr "2019-02-16T17:48:59-08:00"
##   ..$ id               : int 7496378
##  $ :List of 6
##   ..$ instance_id      : int 354092601
##   ..$ rating           : int 0
##   ..$ basic_information:List of 11
##   ..$ folder_id        : int 1
##   ..$ date_added       : chr "2019-02-13T14:13:11-08:00"
##   ..$ id               : int 4490852

Or we can look at some diagrams that I created to help me understand lists.

In all of the diagrams in this post, squares represent lists, with varying degrees of white -> yellow -> orange -> red signifying the list depth, while the pill-shapes represent vectors. The pills are green, but if we run into any colour-blind issues: 1) my apologies, and 2) the shapes will always remain the same. They were all made using LucidChart.

Whether we look at the str() output or the diagram, we can see that basic_information is a list itself! This is just the beginning 😈

Once we look at basic_information, we can see that it also contains a number of lists that go quite deep. Given the data, this nesting is not that surprising; a release can be on multiple labels and come from multiple artists. It does make it fun to work with, though! 🙃

my_collection[[1]][["basic_information"]] %>%
  str()
## List of 11
##  $ labels      :List of 1
##   ..$ :List of 6
##   .. ..$ name            : chr "Tobi Records (2)"
##   .. ..$ entity_type     : chr "1"
##   .. ..$ catno           : chr "TOB-013"
##   .. ..$ resource_url    : chr "https://api.discogs.com/labels/633407"
##   .. ..$ id              : int 633407
##   .. ..$ entity_type_name: chr "Label"
##  $ year        : int 2015
##  $ master_url  : NULL
##  $ artists     :List of 1
##   ..$ :List of 7
##   .. ..$ join        : chr ""
##   .. ..$ name        : chr "Mollot"
##   .. ..$ anv         : chr ""
##   .. ..$ tracks      : chr ""
##   .. ..$ role        : chr ""
##   .. ..$ resource_url: chr "https://api.discogs.com/artists/4619796"
##   .. ..$ id          : int 4619796
##  $ id          : int 7496378
##  $ thumb       : chr "https://img.discogs.com/vEVegHrMNTsP6xG_K6OuFXz4h_U=/fit-in/150x150/filters:strip_icc():format(jpeg):mode_rgb()"| __truncated__
##  $ title       : chr "Demo"
##  $ formats     :List of 1
##   ..$ :List of 4
##   .. ..$ descriptions:List of 1
##   .. .. ..$ : chr "Numbered"
##   .. ..$ text        : chr "Black"
##   .. ..$ name        : chr "Cassette"
##   .. ..$ qty         : chr "1"
##  $ cover_image : chr "https://img.discogs.com/EmbMh7vsElksjRgoXLFSuY1sjRQ=/fit-in/500x499/filters:strip_icc():format(jpeg):mode_rgb()"| __truncated__
##  $ resource_url: chr "https://api.discogs.com/releases/7496378"
##  $ master_id   : int 0

For me, the first step is to extract the basic_information part only, and transpose it. purrr:transpose() “turns a list-of lists ‘inside-out.’” I want to turn the list inside-out so that it starts to look more like a data frame, with a list of variables rather than a list of records (like, observations – not 💿).

library(purrr)

basic_information <- my_collection %>%
  map("basic_information") %>%
  transpose()

Once we do that, the list looks totally different. Now we have a list that contains the variables from basic_information, and each of those is a list with 157 elements (lists) in them.

basic_information %>%
  str(max.level = 1)
## List of 11
##  $ labels      :List of 157
##  $ year        :List of 157
##  $ master_url  :List of 157
##  $ artists     :List of 157
##  $ id          :List of 157
##  $ thumb       :List of 157
##  $ title       :List of 157
##  $ formats     :List of 157
##  $ cover_image :List of 157
##  $ resource_url:List of 157
##  $ master_id   :List of 157

So,

  1. Yes, this diagram is massive.
  2. No, the exact data is not always the same from diagram to diagram 💁
  3. An empty list means there is no data. Some variables have missing data, e.g. sometimes there isn’t text in formats. Sometimes master_url is missing, etc.

Now I have a wayyy better idea of what all that ugly str() output (omitted here) means. The deep nesting in descriptions is horrifying, but I’m confident in my ability to do it 💪.

The next step is to make that list into a tibble, and only keep the variables I care about. Once we do that, it’s a little less hectic to look at.

basic_information_tibble <- basic_information %>%
  as_tibble() %>%
  select(id, title, artists, formats)

head(basic_information_tibble)
## # A tibble: 6 x 4
##   id        title     artists    formats   
##   <list>    <list>    <list>     <list>    
## 1 <int [1]> <chr [1]> <list [1]> <list [1]>
## 2 <int [1]> <chr [1]> <list [1]> <list [1]>
## 3 <int [1]> <chr [1]> <list [1]> <list [1]>
## 4 <int [1]> <chr [1]> <list [1]> <list [1]>
## 5 <int [1]> <chr [1]> <list [1]> <list [1]>
## 6 <int [1]> <chr [1]> <list [1]> <list [1]>

Did somebody say list-cols? 😋

The easiest thing to tackle next are the id and title columns. Every release has only one of each, so we can unlist these and turn the columns into… non-list-cols? reg-cols.

basic_information_id_title_unlist <- basic_information_tibble %>% 
  mutate_at(vars(id, title), unlist)

head(basic_information_id_title_unlist)
## # A tibble: 6 x 4
##        id title                             artists    formats   
##     <int> <chr>                             <list>     <list>    
## 1 7496378 Demo                              <list [1]> <list [1]>
## 2 4490852 Observant Com El Mon Es Destrueix <list [1]> <list [1]>
## 3 5556486 Fuck Off                          <list [1]> <list [1]>
## 4 9827276 I                                 <list [1]> <list [1]>
## 5 9769203 Oído Absoluto                     <list [1]> <list [1]>
## 6 7237138 A Cat's Cause, No Dogs Problem    <list [1]> <list [1]>

(“sorry” about the third item – it’s a great record!)

I’ve coloured them in blue now, to indicate they’re no longer lists (not differentiating between integer and character, though).

To make artists workable, I’m transposing it, just like we did with the original list.

basic_information_artists_transpose <- basic_information_id_title_unlist %>%
  mutate(artists = map(artists, transpose))

head(basic_information_artists_transpose)
## # A tibble: 6 x 4
##        id title                             artists    formats   
##     <int> <chr>                             <list>     <list>    
## 1 7496378 Demo                              <list [7]> <list [1]>
## 2 4490852 Observant Com El Mon Es Destrueix <list [7]> <list [1]>
## 3 5556486 Fuck Off                          <list [7]> <list [1]>
## 4 9827276 I                                 <list [7]> <list [1]>
## 5 9769203 Oído Absoluto                     <list [7]> <list [1]>
## 6 7237138 A Cat's Cause, No Dogs Problem    <list [7]> <list [1]>

It looks a little different in the tibble printing, but tough to tell what’s going on. Instead of a list for each artist that contains name, entity_type, etc, now each of those are lists that contain information on each artist.

So, if there are two artists on a release, their names will both appear under name, rather than having a list for each, and each with a name element.

The last column to work with is formats, arguably the ugliest one! I actually don’t want to transpose it, because descriptions is already in the format we want. If we transpose it, then it’ll turn that inside out, which is… not good. Believe me, I spent a while playing with it before I realized what was happening 🙈.

We want to unlist formats, just like we did for id and title. The only difference is that we just want to remove the first listing hierarchy, so we’ll set recursive = FALSE.

The reason that we can do this to formats is because it’s unnecessarily nested. Unlike artists, there are no cases where there’s more than one format attached to a release, so the nested list isn’t necessary.

basic_information_formats_unlist <- basic_information_artists_transpose %>%
  mutate(formats = unlist(formats, recursive = FALSE))

head(basic_information_formats_unlist)
## # A tibble: 6 x 4
##        id title                             artists    formats   
##     <int> <chr>                             <list>     <list>    
## 1 7496378 Demo                              <list [7]> <list [4]>
## 2 4490852 Observant Com El Mon Es Destrueix <list [7]> <list [3]>
## 3 5556486 Fuck Off                          <list [7]> <list [3]>
## 4 9827276 I                                 <list [7]> <list [3]>
## 5 9769203 Oído Absoluto                     <list [7]> <list [3]>
## 6 7237138 A Cat's Cause, No Dogs Problem    <list [7]> <list [3]>

But just like for artists, now the actual elements are at the top level of the list, rather than being buried.

You thought we were done?! Ha! We still need to get information out of artists and formats.

I want to get the id from artists so that (eventually), I can use the API again to get even more information about the artist.

basic_information_tidying <- basic_information_formats_unlist %>%
  mutate(artists_id = map(artists, "id")) %>%
  select(-artists)

basic_information_tidying %>%
  head()
## # A tibble: 6 x 4
##        id title                             formats    artists_id
##     <int> <chr>                             <list>     <list>    
## 1 7496378 Demo                              <list [4]> <list [1]>
## 2 4490852 Observant Com El Mon Es Destrueix <list [3]> <list [1]>
## 3 5556486 Fuck Off                          <list [3]> <list [1]>
## 4 9827276 I                                 <list [3]> <list [1]>
## 5 9769203 Oído Absoluto                     <list [3]> <list [1]>
## 6 7237138 A Cat's Cause, No Dogs Problem    <list [3]> <list [1]>

But this isn’t quite it. artists_id is still a list of lists.

basic_information_tidying <- basic_information_tidying %>%
  mutate(artists_id = map(artists_id, unlist))

head(basic_information_tidying)
## # A tibble: 6 x 4
##        id title                             formats    artists_id
##     <int> <chr>                             <list>     <list>    
## 1 7496378 Demo                              <list [4]> <int [1]> 
## 2 4490852 Observant Com El Mon Es Destrueix <list [3]> <int [1]> 
## 3 5556486 Fuck Off                          <list [3]> <int [1]> 
## 4 9827276 I                                 <list [3]> <int [1]> 
## 5 9769203 Oído Absoluto                     <list [3]> <int [1]> 
## 6 7237138 A Cat's Cause, No Dogs Problem    <list [3]> <int [1]>

Now it’s a list of integer vectors. This is, I think, the kind of “list-col” that I’m more familiar with. This is the lego-game-of-thrones-horse-on-a-balcony kind of list-col.

I’ve tried to illustrate the difference here. The little circles mean it’s a character vector with two elements.

I also want the name and descriptions bits from formats.

name is really easy because it’s just a character vector, and never contains multiple elements.

basic_information_tidying <- basic_information_tidying %>%
  mutate(format_name = map_chr(formats, "name"))

head(basic_information_tidying)
## # A tibble: 6 x 5
##        id title                            formats   artists_id format_name
##     <int> <chr>                            <list>    <list>     <chr>      
## 1 7496378 Demo                             <list [4… <int [1]>  Cassette   
## 2 4490852 Observant Com El Mon Es Destrue… <list [3… <int [1]>  Vinyl      
## 3 5556486 Fuck Off                         <list [3… <int [1]>  Vinyl      
## 4 9827276 I                                <list [3… <int [1]>  Vinyl      
## 5 9769203 Oído Absoluto                    <list [3… <int [1]>  Vinyl      
## 6 7237138 A Cat's Cause, No Dogs Problem   <list [3… <int [1]>  Vinyl

Finally, descriptions, for all the headache it’s been, is the exact same to extract as artists_id. I’ve set .default = FALSE in case any are missing. NAs are a lot easier to work with than NULLs later on (endless thank you to Jenny Bryan, not only for absolutely everything I know about lists and purrr, but for telling me about this argument).

basic_information_tidying <- basic_information_tidying %>%
  mutate(format_description = map(formats, "descriptions", .default = NA),
         format_description = map(format_description, unlist))

With a little renaming and reordering, our final dataset looks like this:

basic_information_tidy <- basic_information_tidying %>%
    select(release_id = id, title, artists_id, format_description, format_name)

head(basic_information_tidy)
## # A tibble: 6 x 5
##   release_id title                 artists_id format_descripti… format_name
##        <int> <chr>                 <list>     <list>            <chr>      
## 1    7496378 Demo                  <int [1]>  <chr [1]>         Cassette   
## 2    4490852 Observant Com El Mon… <int [1]>  <chr [1]>         Vinyl      
## 3    5556486 Fuck Off              <int [1]>  <chr [1]>         Vinyl      
## 4    9827276 I                     <int [1]>  <chr [3]>         Vinyl      
## 5    9769203 Oído Absoluto         <int [1]>  <chr [2]>         Vinyl      
## 6    7237138 A Cat's Cause, No Do… <int [1]>  <chr [2]>         Vinyl

What are we going to do with it? That’s a topic for another post 👋.