Sharla Gelfand

February 2019: More discogging

(Yes, I know it’s March. This is a delayed followup for the February edition of my #mememe2019 project. It involves even more (and sometimes painful) data cleaning. If you just want to see some actual visualizations, head over to part three.)

In one of my last posts, I went through painstaking lengths to understand and tidy the nested lists that the discogs API returns. Now, it’s time to actually do something with the data!

I’m interested in a few things about my music collection, specifically:

  • the style of music,
  • the year it came out,
  • where it came from.

The data I have so far, unfortunately, doesn’t contain any of that information! So we need to go back to discogger for more.

Here’s what I have so far:

library(dplyr)
library(here)

basic_information_tidy <- readRDS(here("static", "data", "discogger", "basic_information_tidy.rds"))

basic_information_tidy
## # A tibble: 157 x 5
##    release_id title                artists_id format_descripti… format_name
##         <int> <chr>                <list>     <list>            <chr>      
##  1    7496378 Demo                 <int [1]>  <chr [1]>         Cassette   
##  2    4490852 Observant Com El Mo… <int [1]>  <chr [1]>         Vinyl      
##  3    5556486 Fuck Off             <int [1]>  <chr [1]>         Vinyl      
##  4    9827276 I                    <int [1]>  <chr [3]>         Vinyl      
##  5    9769203 Oído Absoluto        <int [1]>  <chr [2]>         Vinyl      
##  6    7237138 A Cat's Cause, No D… <int [1]>  <chr [2]>         Vinyl      
##  7   13117042 Tashme               <int [1]>  <chr [2]>         Vinyl      
##  8    7113575 Demo                 <int [1]>  <chr [1]>         Cassette   
##  9   10540713 Let The Miracles Be… <int [1]>  <chr [1]>         Cassette   
## 10   11260950 Sub Space            <int [1]>  <lgl [1]>         Cassette   
## # … with 147 more rows

First, I’ll use release_id to go get more information about each release, using the discogger::discogs_release() function.

According to the discogs API documentation, “Requests are throttled by the server by source IP to 60 per minute for authenticated requests, and 25 per minute for unauthenticated requests, with some exceptions.” I’ll need to write my code in a way that handles this rate limit, instead of just failing!

A good way to do that is using the insistently() function in the purrr package. insistently modifies a function to make it retry a certain number of times if there’s an error.

It does this by using a rate object that determines the number of retries and the time to wait between each. I’m going to use the rate_delay() helper function. You just tell rate_delay() how long to wait (the pause argument) and how many times to try (max_times).

Since the API limits you by requests per minute, I’ll set pause to 60 (seconds) and max_times to 2 – if we can’t get it after two tries, then it’s probably an actual error, and not rate limiting.

library(purrr)
library(discogger)

insistently_discogs_release <- insistently(discogs_release,
                                           rate = rate_delay(pause = 60, max_times = 2))

Now, if there is still an error after dealing with the rate limiting, I don’t want the whole thing to fail! So in addition to insistently(), I’m making the function work safely by using safely(), which modifies the function to return both a result and an error every time. If there’s no error, then result is not NULL and error is NULL. If there is an error, then the opposite is true. And actually, I’m going to modify safely() a bit so it returns NA (a missing value indicator) instead of NULL (a literal null object).

safely_insistently_discogs_release <- safely(insistently_discogs_release,
                                             otherwise = NA)

Now that that’s done, I can actually use this safe, insistent function to get the release information for every release in my collection!

release_data <- basic_information_tidy %>%
  select(release_id) %>%
  mutate(data = map(release_id, safely_insistently_discogs_release))

Looking at at what my function returned, I can see that data is a list that contains result and error.

release_data %>%
  slice(1) %>%
  str(max.level = 3)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
##  $ release_id: int 7496378
##  $ data      :List of 1
##   ..$ :List of 2
##   .. ..$ result:List of 3
##   .. .. ..- attr(*, "class")= chr "discogs_database"
##   .. ..$ error : NULL

Let’s pull them out! And, like before, the actual release information is in an element of the result list called content (the rest just contains information about the API call).

Note that I’ve set the .default argument in the map() functions to NA – otherwise, they return a NULL, which is way harder to work with! Would it be a blog post using purrr if I didn’t thank Jenny Bryan for telling me how to do something? 🙏

release_data <- release_data %>%
  mutate(result = map(data, "result", .default = NA),
         error = map(data, "error", .default = NA),
         release_content = map(result, "content", .default = NA))

release_data
## # A tibble: 157 x 5
##    release_id data       result                 error     release_content
##         <int> <list>     <list>                 <list>    <list>         
##  1    7496378 <list [2]> <S3: discogs_database> <lgl [1]> <list [31]>    
##  2    4490852 <list [2]> <S3: discogs_database> <lgl [1]> <list [34]>    
##  3    5556486 <list [2]> <S3: discogs_database> <lgl [1]> <list [32]>    
##  4    9827276 <list [2]> <S3: discogs_database> <lgl [1]> <list [34]>    
##  5    9769203 <list [2]> <S3: discogs_database> <lgl [1]> <list [34]>    
##  6    7237138 <list [2]> <S3: discogs_database> <lgl [1]> <list [34]>    
##  7   13117042 <list [2]> <S3: discogs_database> <lgl [1]> <list [33]>    
##  8    7113575 <list [2]> <S3: discogs_database> <lgl [1]> <list [34]>    
##  9   10540713 <list [2]> <S3: discogs_database> <lgl [1]> <list [33]>    
## 10   11260950 <list [2]> <S3: discogs_database> <lgl [1]> <list [32]>    
## # … with 147 more rows

I’m not going to go as wild with the nested lists today, even though they’re definitely there, because 1) I think we’ve all learned our lesson, and 2) I really don’t need that much information.

All I want is information on the music style and the year it was released, which are at the top level:

release_data %>%
  slice(1) %>%
  pull(release_content) %>%
  str(max.level = 2)
## List of 1
##  $ :List of 31
##   ..$ status            : chr "Accepted"
##   ..$ series            : list()
##   ..$ labels            :List of 1
##   ..$ community         :List of 7
##   ..$ year              : int 2015
##   ..$ images            :List of 1
##   ..$ format_quantity   : int 1
##   ..$ id                : int 7496378
##   ..$ artists_sort      : chr "Mollot"
##   ..$ genres            :List of 1
##   ..$ thumb             : chr "https://img.discogs.com/vEVegHrMNTsP6xG_K6OuFXz4h_U=/fit-in/150x150/filters:strip_icc():format(jpeg):mode_rgb()"| __truncated__
##   ..$ num_for_sale      : int 0
##   ..$ title             : chr "Demo"
##   ..$ artists           :List of 1
##   ..$ date_changed      : chr "2015-09-19T12:56:14-07:00"
##   ..$ lowest_price      : NULL
##   ..$ styles            :List of 2
##   ..$ released_formatted: chr "18 Sep 2015"
##   ..$ formats           :List of 1
##   ..$ estimated_weight  : int 65
##   ..$ released          : chr "2015-09-18"
##   ..$ date_added        : chr "2015-09-19T08:54:15-07:00"
##   ..$ extraartists      : list()
##   ..$ tracklist         :List of 5
##   ..$ notes             : chr "Spray-painted black tapes, 31 hand numered copies, risograph printed artwork. "
##   ..$ identifiers       : list()
##   ..$ companies         : list()
##   ..$ uri               : chr "https://www.discogs.com/Mollot-Demo/release/7496378"
##   ..$ country           : chr "Chile"
##   ..$ resource_url      : chr "https://api.discogs.com/releases/7496378"
##   ..$ data_quality      : chr "Needs Vote"

We can grab those pretty easily, phew! And chuck everything else out. Again, I’m setting the default to be NA because those values might not be there for every release. If they’re not, I don’t want to deal with NULLs!

release_data <- release_data %>%
  mutate(year = map_dbl(release_content, "year", .default = NA_integer_),
         styles = map(release_content, "styles", .default = NA)) %>%
  select(-data, -result, -error, -release_content)

Like some things were in my last blog post, styles is a list of lists:

release_data
## # A tibble: 157 x 3
##    release_id  year styles    
##         <int> <dbl> <list>    
##  1    7496378  2015 <list [2]>
##  2    4490852  2013 <list [2]>
##  3    5556486  2014 <list [2]>
##  4    9827276  2017 <list [2]>
##  5    9769203  2017 <list [1]>
##  6    7237138  2015 <list [2]>
##  7   13117042  2019 <list [2]>
##  8    7113575  2014 <list [2]>
##  9   10540713  2015 <list [2]>
## 10   11260950  2017 <list [2]>
## # … with 147 more rows
release_data %>%
  slice(1) %>%
  pull(styles) %>%
  str()
## List of 1
##  $ :List of 2
##   ..$ : chr "Punk"
##   ..$ : chr "Hardcore"

when really I just want a list of character vectors.

release_data <- release_data %>%
  mutate(styles = map(styles, ~ unlist(.), .default = NA))

release_data %>%
  slice(1) %>%
  pull(styles) %>%
  str()
## List of 1
##  $ : chr [1:2] "Punk" "Hardcore"

Awesome!

Except, for my purposes I actually only want one style per release. I know. So, I looked through them and came up with a prioritized list of styles - for example, if a release has both “hardcore” and “punk,” I’d take “hardcore” as the style.

style_priority <- function(style){
  case_when(style %in% c("New Wave", "Hardcore") ~ 1,
            style %in% c("Indie Rock", "Shoegaze", "Stoner Rock", "Black Metal", "Indie Pop") ~ 2,
            style %in% c("Experimental", "Post-Punk", "Grunge", "Synth-pop") ~ 3,
            style %in% c("Punk") ~ 4,
            TRUE ~ 5)
}

library(tidyr)

release_data <- release_data %>%
  unnest() %>%
  mutate(style_priority = style_priority(styles)) %>%
  group_by(release_id) %>%
  filter(style_priority == min(style_priority)) %>%
  ungroup() %>%
  select(-style_priority) %>%
  rename(style = styles)

Some releases are missing a style, so I’ll manually put what I think it should be.

release_data <- release_data %>%
  mutate(style = case_when(release_id %in% c(9477305) ~ "Post-Punk",
                           release_id %in% c(6022854, 4465008) ~ "Indie Rock",
                           release_id %in% c(7018322) ~ "Indie Pop",
                           release_id %in% c(2819870, 4298516) ~ "Hip Hop",
                           TRUE ~ style))

release_data
## # A tibble: 157 x 3
##    release_id  year style    
##         <int> <dbl> <chr>    
##  1    7496378  2015 Hardcore 
##  2    4490852  2013 Hardcore 
##  3    5556486  2014 Post-Punk
##  4    9827276  2017 Hardcore 
##  5    9769203  2017 Punk     
##  6    7237138  2015 Hardcore 
##  7   13117042  2019 Hardcore 
##  8    7113575  2014 Hardcore 
##  9   10540713  2015 Post-Punk
## 10   11260950  2017 Hardcore 
## # … with 147 more rows

Perfect!

Next, I want to know where each release came from (i.e., where the artists are from), so I have to use the API to look up information by artist.

Again, I’m creating a safe and insistent version of the appropriate discogger function to do so!

I’m also excluding the id 194 because it’s just a placeholder ID used any time an artist on a release is “Various”, and doesn’t have an artists page – no point in hitting the API for results I know aren’t there!

library(tidyr)
safely_insistently_discogs_artist <- safely(insistently(discogs_artist,
                                                        rate = rate_delay(pause = 60,
                                                                          max_times = 2)),
                                            otherwise = NA)

artist_data <- basic_information_tidy %>%
  select(artists_id) %>%
  unnest() %>%
  distinct() %>%
  filter(artists_id != 194) %>%
  mutate(data = map(artists_id, safely_insistently_discogs_artist),
         result = map(data, "result", .default = NA),
         error = map(data, "error", .default = NA),
         artist_content = map(result, "content", .default = NA))

Errors?

artist_data %>%
  filter(!is.na(error))
## # A tibble: 0 x 5
## # … with 5 variables: artists_id <int>, data <list>, result <list>,
## #   error <list>, artist_content <list>

Nice! If I didn’t modify the function to run insistently and safely, there would have been lots of errors. Believe me, I tried. 😅

All I want from the artist data are the profile and the name, which are pretty easy to grab.

artist_data <- artist_data %>%
  mutate(profile = map_chr(artist_content, "profile", .default = NA_character_),
         artist_name = map_chr(artist_content, "name", .default = NA_character_)) %>%
  select(artist_id = artists_id, profile, artist_name)

artist_data
## # A tibble: 120 x 3
##    artist_id profile                                     artist_name       
##        <int> <chr>                                       <chr>             
##  1   4619796 Canadian hardcore-punk band from Toronto, … Mollot            
##  2   3192745 Hardcore / punk band from Barcelona, Spain. Una Bèstia Incont…
##  3   2876549 Punk band from London, UK.                  Good Throb        
##  4   2769828 S.H.I.T. is a hardcore / punk band from To… S.H.I.T. (3)      
##  5   4282571 Punk band from Madrid (Spain).              Rata Negra        
##  6   3596548 American hardcore / punk band from New Yor… Ivy (18)          
##  7   5211980 Canadian hardcore / punk band from Toronto… Tashme            
##  8   4450861 Canadian hardcore / punk band from Calgary… Desgraciados      
##  9   4273896 Punk band from Kansas City.                 Phantom Head      
## 10   5694086 American hardcore-punk band from New York … Sub Space (2)     
## # … with 110 more rows

Now I want to do two things. The first is actually get the artist name – when there’s a duplicate artist name on discogs, they put a number in parentheses at the end of the name. I don’t want that!

artist_data <- artist_data %>%
  separate(artist_name, into = "artist_name", sep = " \\([0-9]")

artist_data
## # A tibble: 120 x 3
##    artist_id profile                                     artist_name       
##        <int> <chr>                                       <chr>             
##  1   4619796 Canadian hardcore-punk band from Toronto, … Mollot            
##  2   3192745 Hardcore / punk band from Barcelona, Spain. Una Bèstia Incont…
##  3   2876549 Punk band from London, UK.                  Good Throb        
##  4   2769828 S.H.I.T. is a hardcore / punk band from To… S.H.I.T.          
##  5   4282571 Punk band from Madrid (Spain).              Rata Negra        
##  6   3596548 American hardcore / punk band from New Yor… Ivy               
##  7   5211980 Canadian hardcore / punk band from Toronto… Tashme            
##  8   4450861 Canadian hardcore / punk band from Calgary… Desgraciados      
##  9   4273896 Punk band from Kansas City.                 Phantom Head      
## 10   5694086 American hardcore-punk band from New York … Sub Space         
## # … with 110 more rows

Good!

The next is a little more… complicated.

The band profile tends to take the format “… band from [location]”, with information about the location from “from” to the next period or to the end of the string. To extract this, I’m using a regex that I don’t totally understand, but that my coworker Sharleen has blogged about (and told me about when I bet she couldn’t do it even though she had already told me she did it 😬).

library(stringr)

artist_data <- artist_data %>%
  mutate(location = coalesce(str_extract(profile, "(?<=from\\s).+?(?=\\.)"),
                             str_extract(profile, "(?<=from).*"))) %>%
  separate(location, into = c("city", "province", "country"), sep = "[:punct:]") %>%
  mutate_at(vars(city, province, country), str_trim)

artist_data
## # A tibble: 120 x 6
##    artist_id profile                 artist_name    city   province country
##        <int> <chr>                   <chr>          <chr>  <chr>    <chr>  
##  1   4619796 Canadian hardcore-punk… Mollot         Toron… Ontario  <NA>   
##  2   3192745 Hardcore / punk band f… Una Bèstia In… Barce… Spain    <NA>   
##  3   2876549 Punk band from London,… Good Throb     London UK       <NA>   
##  4   2769828 S.H.I.T. is a hardcore… S.H.I.T.       Toron… Ontario  Canada 
##  5   4282571 Punk band from Madrid … Rata Negra     Madrid Spain    ""     
##  6   3596548 American hardcore / pu… Ivy            New Y… <NA>     <NA>   
##  7   5211980 Canadian hardcore / pu… Tashme         Toron… Ontario  <NA>   
##  8   4450861 Canadian hardcore / pu… Desgraciados   Calga… Alberta  <NA>   
##  9   4273896 Punk band from Kansas … Phantom Head   Kansa… <NA>     <NA>   
## 10   5694086 American hardcore-punk… Sub Space      New Y… <NA>     <NA>   
## # … with 110 more rows

It looks ok, but the data isn’t super consistent in terms of whether it includes a province, country, both, neither, etc. To try and figure out what belongs where, if there isn’t an entry in country then I’m going to look at province and compare it to a list of countries from the maps package. If it’s in there, then I’ll set country to that. Capeesh?

library(maps)

countries <- world.cities %>%
  pull(country.etc) %>%
  unique()

artist_data <- artist_data %>%
  mutate(province_is_country = (is.na(country) | country == "") & province %in% countries,
         country = ifelse(province_is_country, province, country),
         province = ifelse(province_is_country, NA_character_, province)) %>%
  select(-province_is_country)

artist_data
## # A tibble: 120 x 6
##    artist_id profile                 artist_name    city   province country
##        <int> <chr>                   <chr>          <chr>  <chr>    <chr>  
##  1   4619796 Canadian hardcore-punk… Mollot         Toron… Ontario  <NA>   
##  2   3192745 Hardcore / punk band f… Una Bèstia In… Barce… <NA>     Spain  
##  3   2876549 Punk band from London,… Good Throb     London <NA>     UK     
##  4   2769828 S.H.I.T. is a hardcore… S.H.I.T.       Toron… Ontario  Canada 
##  5   4282571 Punk band from Madrid … Rata Negra     Madrid <NA>     Spain  
##  6   3596548 American hardcore / pu… Ivy            New Y… <NA>     <NA>   
##  7   5211980 Canadian hardcore / pu… Tashme         Toron… Ontario  <NA>   
##  8   4450861 Canadian hardcore / pu… Desgraciados   Calga… Alberta  <NA>   
##  9   4273896 Punk band from Kansas … Phantom Head   Kansa… <NA>     <NA>   
## 10   5694086 American hardcore-punk… Sub Space      New Y… <NA>     <NA>   
## # … with 110 more rows

That’s starting to look better! But some artists don’t even have a city:

artist_data %>%
  filter(is.na(city))
## # A tibble: 40 x 6
##    artist_id profile               artist_name       city  province country
##        <int> <chr>                 <chr>             <chr> <chr>    <chr>  
##  1   6370901 ""                    Small Man         <NA>  <NA>     <NA>   
##  2   4602432 ""                    Pines             <NA>  <NA>     <NA>   
##  3   4892872 ""                    Stuck Pigs        <NA>  <NA>     <NA>   
##  4   4911807 American hardcore-pu… Krimewatch        <NA>  <NA>     <NA>   
##  5   4993817 ""                    Pleather          <NA>  <NA>     <NA>   
##  6   6400672 ""                    Deedee Catpiss A… <NA>  <NA>     <NA>   
##  7   6400673 ""                    Disco Lemonade    <NA>  <NA>     <NA>   
##  8   6111119 ""                    Bad Example       <NA>  <NA>     <NA>   
##  9   5908156 ""                    Payback           <NA>  <NA>     <NA>   
## 10   5621225 ""                    FLEM              <NA>  <NA>     <NA>   
## # … with 30 more rows

and for those whose profile isn’t just "",

artist_data %>%
  filter(is.na(city) & profile != "") %>%
  pull(profile)
##  [1] "American hardcore-punk band based in  New York City."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
##  [2] "American band formed in 1979 in Los Angeles, California, who plays a fusion of funk, metal, rock, reggae, ska, punk, soul, ..."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
##  [3] "Talking Heads were an American rock band formed in 1975 in New York City and disbanded in 1991. The band was comprised of [a3918] (lead vocals, guitar), [a47110] (drums), [a47111] (bass), and [a126729] (keyboards, guitar). They released eight albums, in addition to several well received live albums.\r\n\r\nInducted into Rock And Roll Hall of Fame in 2002 (Performer).\r\n\r\nGenres: New wave, post-punk, art punk, pop rock, art pop, funk rock\r\n"                                                                                                                                                                                                                                                                                                                                                                                                                         
##  [4] "Formed: 1978 // Corte Madera, CA, United States \r\nMembers:\r\nHuey Lewis (lead vocals, harmonica)\r\nJohnny Colla (saxophone, guitar, vocals)\r\nBill Gibson (percussion, vocals)\r\nJohn Pierce (bass) 1995 - \r\nSean Hopper (keyboards, vocals)\r\nStef Burns (lead guitar, vocals) 2000 - \r\n\"The News Brothers\" 1994 - \r\nMarvin McFadden (trumpet)\r\nRob Sudduth (tenor & baritone saxophone) \r\n\r\nFormer members:\r\nMario Cipollina (bass, vocals) 1979 - 1995\r\nChris Hayes (lead guitar, vocals) 1979 - 2000\r\nRon Stallings (tenor saxophone) 1994 - 2009"                                                                                                                                                                                                                                                                                                         
##  [5] "Olympia, Washington based punk trio Naomi Punk featuring singer/guitarist Travis Benjamin Coster, guitarist Neil Gregerson, and drummer Nic Luempert."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
##  [6] "The Smiths were an English rock band, formed in 1982 in Manchester by [url=http://www.discogs.com/artist/92577-Morrissey]Morrissey[/url] and [a37120].\r\nDisbanded in 1987."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
##  [7] "American New Wave / Art Pop band formed in Athens, Georgia, USA in 1976.\r\n\r\nKate Pierson (b. 27 April 1948, Weehawken, New Jersey, USA; organ/vocals)\r\nCindy Wilson (b. 28 February 1957, Athens, Georgia, USA; guitar/vocals)\r\nRicky Wilson (b. 19 March 1953, Athens, Georgia, USA d. 12 October 1985; guitar)\r\nFred Schneider (b. 1 July 1951, Newark, New Jersey, USA; keyboards/vocals)\r\nKeith Strickland (b. 26 October 1953, Athens, Georgia, USA; drums. Later switched to guitar after Ricky Wilson's passing.)\r\n\r\n"                                                                                                                                                                                                                                                                                                                                             
##  [8] "From Hattiesburg, Mississippi."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
##  [9] "American indie rock band founded in 1998 in Los Angeles and disbanded in 2011."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
## [10] "Dir Yassin was an Israeli punk hardcore band formed in December 1997. The original line-up was Kobbi (guitar), Fede (drum), Haim (bass), Adi (vocal), Federico (vocal). \r\nThe name was chosen as an effort to bring the sublimed bloody facts of Zionism and Israel's history back into its collective consciousness."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
## [11] "For the band Cat Power with [a254210] as member please use [a6207375]. \r\n\r\nAmerican singer-songwriter, born 21 January 1972 in Atlanta, Georgia, USA."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
## [12] "Earth, pioneers of the drone/doom genre, make music that is heavier than heavy, using incredibly downtuned guitars and usually playing no more than a couple riffs per song.  They are based in Seattle, Washington."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
## [13] "Popular hip hop duo composed of [a=Jeff Townes] AKA [a=DJ Jazzy Jeff] (b. Jeffrey A. Townes, 22 January 1965, Philadelphia, Pennsylvania, USA) and [a=Will Smith] AKA [a=The Fresh Prince] (b. Willard Christopher Smith Jr., 25 September 1968, Philadelphia, Pennsylvania, USA).\r\nJazzy Jeff started DJing in the mid-1970's when he was a mere 10 years old, (though he is not to be confused with the similarly-titled Jazzy Jeff who cut an album, also for Jive, in 1985). \r\nTogether with [a=Will Smith] , they became the first rap act to receive a Grammy Award for their second LP \"Parents Just Don't Understand\". \r\nMeanwhile, Jeff founded \"A Touch Of Jazz Inc\", a stable of producers working on rap/R&B projects. The duo picked up a second Grammy for \"Summertime\" in 1991, before scoring a shock UK number 1 in 1993 with \"Boom! Shake The Room\". \r\n"
## [14] "Chilean twee pop band that started on 2013."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [15] "Vancouver, British Columbia\r\nDisbanded in 2015"

Well, it’s because their profile’s don’t follow the format “… band from [location]”! That’s ok, I’m not too proud to manually enter data. I’m going to combine the manually entered locations with the ones that I derived (prioritizing the manual ones to fix a few typos).

manual_locations <- readr::read_csv(here("static", "data", "discogger", "manual_locations.csv"))

artist_data <- artist_data %>%
  left_join(manual_locations, 
            by = "artist_name",
            suffix = c("_discogs", "_manual")) %>%
  mutate(city = coalesce(city_manual, city_discogs),
         city = ifelse(city == "New York City", "New York", city),
         province = coalesce(province_manual, province_discogs),
         country = coalesce(country_manual, country_discogs)) %>%
  select(-ends_with("_discogs"), -ends_with("_manual"), -profile)

My goal (which I just realized I haven’t mentioned yet!) is to get the latitude and longitude of the city that each artist is from, hence wanting the city and, where possible, the province/state and country. So again, I’m using the world.cities data set from maps!

artist_locations <- artist_data %>%
  left_join(world.cities %>%
              select(-pop, -capital),
            by = c("city" = "name"))

Obviously some city names exist more than once in the world. In those cases (where there’s duplicate entries of a band in artist_locations, I’ll keep cases where country and country.etc match or where we don’t have any information on the country.

artist_locations <- artist_locations %>%
  add_count(artist_id) %>%
  filter((n > 1 & (country == country.etc | is.na(country))) | 
           n == 1) %>%
  select(-n)

Let’s see who we still have too much (yet not enough!) information on:

library(janitor)

artist_locations %>%
  get_dupes(artist_id) %>%
  count(city)
## # A tibble: 12 x 2
##    city            n
##    <chr>       <int>
##  1 Birmingham      2
##  2 Boston          6
##  3 Denton          2
##  4 Dublin          2
##  5 Kansas City     6
##  6 Manchester      2
##  7 Portland        3
##  8 Richmond        7
##  9 San Diego       5
## 10 San Jose       48
## 11 Springfield     6
## 12 Washington      6

Most of these are dupes because the city exists in multiple countries, but we don’t have country information on the artist. I’ll put this in manually, then again filter for where country = country.etc

artist_locations <- artist_locations %>%
  mutate(country = case_when(city %in% c("Birmingham", "Boston", "Denton", "San Diego", "San Jose", "Washington", "Portland", "Kansas City", "Springfield", "Richmond") ~ "USA",
                             city == "Dublin" ~ "Ireland",
                             city == "Manchester" ~ "UK",
                             TRUE ~ country)) %>%
  add_count(artist_id) %>%
  filter((n > 1 & (country == country.etc | is.na(country))) | 
           n == 1) %>%
  select(-n)

Are we there yet? 😪

artist_locations %>%
  get_dupes(artist_id)
## # A tibble: 15 x 9
##    artist_id dupe_count artist_name city  province country country.etc
##        <int>      <int> <chr>       <chr> <chr>    <chr>   <chr>      
##  1   4273896          2 Phantom He… Kans… <NA>     USA     USA        
##  2   4273896          2 Phantom He… Kans… <NA>     USA     USA        
##  3   4518577          2 Chemos      Port… Oregon   USA     USA        
##  4   4518577          2 Chemos      Port… Oregon   USA     USA        
##  5   4706419          2 Chasm       Kans… Missouri USA     USA        
##  6   4706419          2 Chasm       Kans… Missouri USA     USA        
##  7   4927139          2 Warm Bodies Kans… Missouri USA     USA        
##  8   4927139          2 Warm Bodies Kans… Missouri USA     USA        
##  9   6370901          5 Small Man   Spri… IL       USA     USA        
## 10   6370901          5 Small Man   Spri… IL       USA     USA        
## 11   6370901          5 Small Man   Spri… IL       USA     USA        
## 12   6370901          5 Small Man   Spri… IL       USA     USA        
## 13   6370901          5 Small Man   Spri… IL       USA     USA        
## 14   6513909          2 Itch        Rich… Virginia USA     USA        
## 15   6513909          2 Itch        Rich… Virginia USA     USA        
## # … with 2 more variables: lat <dbl>, long <dbl>

These last few are city names that are in more than one state in the US. I want Kansas City, MO (not KS), Portland OR, Springfield IL, and Richmond VA. Of course, the state wasn’t in world.cities. I have to go to us.cities for that.

head(us.cities)
##         name country.etc    pop   lat    long capital
## 1 Abilene TX          TX 113888 32.45  -99.74       0
## 2   Akron OH          OH 206634 41.08  -81.52       0
## 3 Alameda CA          CA  70069 37.77 -122.26       0
## 4  Albany GA          GA  75510 31.58  -84.18       0
## 5  Albany NY          NY  93576 42.67  -73.80       2
## 6  Albany OR          OR  45535 44.62 -123.09       0

It needs some cleaning too 🙄

dupe_us_cities <- us.cities %>%
  mutate(name = str_trim(str_replace(name, country.etc, "")),
         keep = (name == "Portland" & country.etc == "OR") | 
           (name == "Kansas City" & country.etc == "MO") | 
           (name == "Springfield" & country.etc == "IL") |
           (name == "Richmond" & country.etc == "VA")) %>%
  filter(keep) %>%
  select(city = name, lat, long)

Now, I’ll just keep the correct cities for artists with duplicate entries.

artist_locations <- artist_locations %>%
  get_dupes(artist_id) %>%
  inner_join(dupe_us_cities) %>%
  bind_rows(
    artist_locations %>%
      add_count(artist_id) %>%
      filter(n == 1) %>%
      select(-n)
  )

artist_locations %>% 
  get_dupes(artist_id)
## # A tibble: 0 x 9
## # … with 9 variables: artist_id <int>, dupe_count <int>,
## #   artist_name <chr>, city <chr>, province <chr>, country <chr>,
## #   country.etc <chr>, lat <dbl>, long <dbl>

😅

artist_locations %>% 
  filter(is.na(lat))
## # A tibble: 0 x 9
## # … with 9 variables: artist_id <int>, dupe_count <int>,
## #   artist_name <chr>, city <chr>, province <chr>, country <chr>,
## #   country.etc <chr>, lat <dbl>, long <dbl>

😅 😅

One last thing, I promise – I want to know whether each vinyl release is a 7" or a 12".

basic_information_tidy <- basic_information_tidy %>%
  mutate(format = case_when(format_name == "Cassette" ~ "Tape",
                                    str_detect(format_description, "LP|12") ~ '12"',
                                    str_detect(format_description, "7") ~ '7"')) %>%
  select(-format_name, -format_description)

Now I can put together my collection with information about the release and the artist!

collection_data <- basic_information_tidy %>%
  unnest() %>%
  left_join(release_data, by = "release_id") %>%
  left_join(artist_locations, by = c("artists_id" = "artist_id")) %>%
  select(release_id, title, format, artist_id = artists_id, artist_name, year, style, city, country = country.etc, lat, long)

collection_data
## # A tibble: 169 x 11
##    release_id title format artist_id artist_name  year style city  country
##         <int> <chr> <chr>      <int> <chr>       <dbl> <chr> <chr> <chr>  
##  1    7496378 Demo  Tape     4619796 Mollot       2015 Hard… Toro… Canada 
##  2    4490852 Obse… "12\""   3192745 Una Bèstia…  2013 Hard… Barc… Spain  
##  3    5556486 Fuck… "12\""   2876549 Good Throb   2014 Post… Lond… UK     
##  4    9827276 I     "7\""    2769828 S.H.I.T.     2017 Hard… Toro… Canada 
##  5    9769203 Oído… "12\""   4282571 Rata Negra   2017 Punk  Madr… Spain  
##  6    7237138 A Ca… "7\""    3596548 Ivy          2015 Hard… New … USA    
##  7   13117042 Tash… "7\""    5211980 Tashme       2019 Hard… Toro… Canada 
##  8    7113575 Demo  Tape     4450861 Desgraciad…  2014 Hard… Calg… Canada 
##  9   10540713 Let … Tape     4273896 Phantom He…  2015 Post… Kans… USA    
## 10   11260950 Sub … Tape     5694086 Sub Space    2017 Hard… New … USA    
## # … with 159 more rows, and 2 more variables: lat <dbl>, long <dbl>

In the words of Elle Woods, “we did it!” If you want to see what I actually do with this data, head over to part three.