Sharla Gelfand

RStudio Conf 2018: I didn't lose my wallet or my keys

RStudio Conf 2018 is is done and gone, but I wanted to write down some thoughts before they are forgotten! The conference is packed into an incredible 2 days (4 if you workshopped before) – I attended two keynotes, two fireside chats, and 20 (!) talks. These 20 talks were just a third (!!) of the talks that happened. I asked one question, and got very distracted by hearing the sound of my own voice come from the PA midway through.

If you follow me on twitter (how else would you find this blog?), you might know that I was also very active on the #rstudioconf hashtag.

But “very active” isn’t a number, and given all the talk about crucial data science skills (like counting and division) over the last few days, I wanted some numbers. I’m using Mike Kearney’s fantastic rtweet package (as always!) to pull my own tweets over the last few days (the Thursday to Sunday, aka when I flew in and out of San Diego ✈️).

library(rtweet)
library(dplyr)
library(lubridate)

my_tweets <- get_timeline(user = "sharlagelfand", n = 400) %>%
  mutate(created_at = with_tz(created_at, tz = "America/Los_Angeles"))

my_tweets_conf <- my_tweets %>%
  filter(created_at >= "2018-02-02" & created_at <= "2018-02-05")

my_tweets_rt <- my_tweets_conf %>% 
  group_by(is_retweet) %>% 
  count()

my_tweets_rt
## # A tibble: 2 x 2
## # Groups: is_retweet [2]
##   is_retweet     n
##   <lgl>      <int>
## 1 F             91
## 2 T             37

If we look at these, we can see that I tweeted a total of 128 times – 91 original tweets, and 37 retweets. For these original tweets, I got 672 likes (!!!) and 92 retweets. I take no credit for the response, of course – I’d rather think about this as people engaging with the speakers’ content, rather than with me; the work that went into all those incredible talks was a lot more than the work that went into the tweets 💁

I don’t want to go into a full blown analysis of the tweets, because 1) it’s not what’s important here, and 2) I am extremely tired.

Instead, I wanted to give an overview of a few of my favourite talks, and some general themes that I took away from the conf. I did miss a lot of talks (by design – Hadley himself said their goal was to make deciding between talks as difficult as possible; I ended up missing his, so… mission accomplished), and am looking forward to watching the recordings once they’re available and reading other recaps!

You will quickly be able to tell that I opted for most of the process/industry talks (case-study-n) so there is not a lot of code or tips n tricks here – I didn’t attend anything in the shiny, packages, or programming tracks, so… again, looking forward to other recaps 😋

I’m linking to slides when I can find them, but will update as I find more!

Davis Vaughn: The future of time series and financial analysis in the tidyverse

Davis talked about the tibbletime package, which is “an extension that allows for the creation of time aware tibbles through the setting of a time index.” Slides are here.

As someone who works with a lot of time data, I was so thrilled watching this talk. It is the package I didn’t even realize I needed – I had settled with some of the friction of working with time data (lubridate and Edwin Thoen’s padr package have helped so much) and figured that was that. The tibbletime package is literally just so smart.

I’m particularly excited by easier time-based subsetting (e.g. filtering the data to be between two dates/times) through use of the filter_time() function.

Once you set the time index of the tibble (via df2 <- as_tbl_time(df, index = date)), you can easily subset by doing df2 %>% filter_time(from ~ to), as opposed to the traditional df %>% filter(date >= from & date <= y).

It’s smart enough to take years, months, days, seconds, and combinations of into the from and to arguments. It may seem like a little thing, but I can already see how this mental model will remove so much friction in my day to day.

For example, instead of the code above I could do

library(tibbletime)

my_tweets_conf_tibbletime <- my_tweets %>%
  as_tbl_time(index = created_at) %>%
  arrange(created_at) %>% 
  filter_time("2018-02-02" ~ "2018-02-04")

nrow(my_tweets_conf_tibbletime) == nrow(my_tweets_conf)
## [1] TRUE

and get the exact same data, using interval notation.

You might notice that I had 2018-02-05 as the end date in the dplyr only version of the code, but used 2018-02-04 in the tibbletime version – tibbletime goes to the end of the day, whereas dplyr would treat 2018-02-04 as 2018-02-04 00:00:00, so any records on that date wouldn’t be included – and I wanted them! It’s fine that dplyr does this, of course, and just something you have to remember – after all, it is comparing date times literally. But I do often forget, so this “end of the day” business is great for me!

There’s also the ability to roll up dates to a different interval (e.g., roll up every record in a day to that date) via collapse_by(), which could be used for grouping by that interval to summarise that data. This is where the biggest overlap between tibbletime and padr is (the padr::thicken() function is similar) – there was some discussion between Davis, Edwin, and I on this on twitter if you’re curious.

There is a ton more functionality that I won’t cover – all in all, I am so excited to put this package to use, and so grateful for the work that Davis + Business Science, and Edwin have done to make time series data easier to work with.

JD Long: The unreasonable effectiveness of empathy

JD’s slides aren’t available yet since it looks like he’s on vacation somewhere fun, so I’m going off memory – but his talk centred around empathy and its power when creating data products. I sometimes find it difficult (and I doubt I’m alone!) to get all the context and information I need when doing an analysis. Turns out that a great way to get that context and empathy is by borrowing from agile, and using user stories! These take the form “As an X, I want to Y, so I can Z.”

For example,

As an artist, I want to be able to save pictures on Instagram, so I can use them later for inspiration.

This is a lot different than just saying “let me download pictures” – screenshots already exist, right? I have 0 affiliation with instagram (though I am an admirer of RStudio’s) but quite like how they added in the Saved features with different collections so different pictures can go in different places.

Doing this helps to put you in the end user’s shoes – you understand who they are, what they want to do, and why they want to do it; what impact it’ll have, how it’ll help them. This is often 💯 times more useful than someone telling you what they want – maybe they don’t know exactly, but I think (and probably JD too!) most importantly it helps you to empathize, fully understand their goals, and create products accordingly.

JD’s theme overall was that thinking about who is on the receiving end of your work is really important, and to be specific – as he put it, “when you design for everyone, you design for no one.”

Elaine McVey: Agile data science

Elaine of TransLoc, a talked about common pains that come with doing data science in a company and how to draw from agile to help. The issues are:

  • Marginality: people think (or it’s true) that the data work being done is not the core work of the company.
  • Mystery: people don’t know what data science is, or what data scientists do
  • Misalignment: data science groups are often poorly placed (in the wrong department, as support teams, or without influence)

For marginality, she proposes user stories (familiar from above!). The goal of this is to develop a shared understand and ensure that the work data scientists are doing is related to the core work of the company.

For mystery, she suggests vertical slicing – get to the smallest possible end result as quickly as possible, and check back in. Instead of running 10,000 simulations, run two, and present the results. See the reaction and get a better understanding of the value of the work, and a holistic understanding of what projects are valuable and what aren’t.

Finally, for misalignment she touches on stakeholder reviews to discuss competing priorities. All stakeholders should review vertical slices and see what’s coming up. This ensures that all stakeholders understand what data scientists do and the impact of their work, and forces (my words, not hers) them to prioritize amongst each other for data science resources. You ever try to be the one doing the data science and the one prioritizing it? Yeah 🙃

I have definitely experienced all three pains doing data within a company, and look forward to figuring out how to apply these approaches to my own work!

Elaine has a blog where she outlines the pains and the solutions.

Emily Riederer: tidycf: turning analysis on its head by turning cashflows on their sides

Emily started off her talk by outlining how important cashflow analysis is to Capital One’s business analytics – and yet, so much has been done in different tools (spreadsheets, powerpoint, word processor, databases, BI tools, etc) with poor documentation and reproducibility.

Turns out that these complicated analyses actually follow a familiar analytical model that many of us are used to (import –> tidy –> explore –> commmunicate, you know the one), and that “special” business analyses aren’t as special as we think – a lot can be borrowed from tidy data + reproducibility principles! Based on this, her team built the internal tidycf R package.

Something that I loved was that Emily focused so much of her talk around the process of building this package and integrating it into peoples’ workflows. They focused on empathy (designed to meet users’ needs – there it is again!), empowerment (designed to teach and facilitate), and engagement (designed for extension with invitation to contribute).

Her team also used RMarkdown templates as a way to teach the R code and enable knowledge-transfer on the domain (a theme I saw discussed in a few other talks), with additional focus on opinionated project templates and directories. A big theme here was around setting her colleagues up for success with using R, while emphasizing reproducibility.

Sandy Griffith: Accelerating cancer research with R

Sandy, of Flatiron Health, talked about choosing R over SAS for her team (a theme echoed in another talk, by Beth Atkinson of Mayo Clinic) and the steps that went into that choice, then cultivating and sustaining a strong R team. Cultivating started with a lot of support – an internal R package, user group, Slack channels, training, and hiring. She then focused on sustaining – once everyone is proficient, there’s a need to focus on consistency and contribution, via growing internal packages, and focusing on reproducibility, quality control (I swooned at the mention of unit tests!), and standardization.

She acknowledged that there are challenges now – devoting time to infra, internal package management, and coordinating R usage outside of the Quantitative Sciences team among them, all areas of potential improvement for a company that is quite mature in R.

Sandy is biking through the desert post conf, but I will link her slides (and add more detail on the talk if they jog my memory – I can tell I’m forgetting a lot) once/if they’re available!

Tanya Cashorali: Rapid prototyping data products using Shiny

Tanya has her own consulting business, TCB Analytics, and her talk was about getting shit done, fast, in Shiny (my preferred title; not the real one). She had concrete examples of case studies, and how she worked towards quick solutions, but my favourite parts of the talk was all her advice along the way – on imposter syndrome and on getting shit done (she pulled a lot from the Done manifesto – that site will ask you to whitelist it, I’m sorry, you don’t have to and it still works).

I loved that Tanya pointed out that 1) if you don’t have any imposter syndrome, you probably know less than you think you do and 2) that not doing anything is pretty much the worst thing you can do. You don’t have to do things perfectly – get shit done, and iterate later.

This was all really helpful to me – I definitely struggle with imposter syndrome and trying to make things perfect (which always slows me down), and I quite like shiny… so… 👍🏾

Kayla Patel: Imagine boston 2030: using r-shiny to keep ourselves accountable and empower the public

Kayla works for the city of Boston and talked about a Shiny dashboard she developed for Imagine Boston 2030, a long-term plan that emphasized public input and data to develop 5 major goals for the city, each with their own set of metrics. The dashboard allows users to see the current state for each metric, historical trends, progress, and the city’s plans for each goal.

I loved so many things about this talk – primarily the fact that her team thought so much about how to visualize and communicate data. The IB2030 plan had so much contribution from the public and the goal of the dashbard is to communicate the past and current states of each metric, along with how the city plans to improve them; it doesn’t make any sense to have data visualizations that are unclear and inacessible, and they did a ton of work to ensure that wasn’t the case.

Of course, it was awesome to see how open source tools (tidyverse, shinydashboard, plotly, and leaflet among them) contributed to this work, and the iterations involved in making the code cleaner and easier to read (TODO: learn shiny modules).

I also learned that the city of Boston has a pretty cool open data portal and initiative!

You can read more about the dashboard here, and see it here.

Wrapup

Honorable mention goes to Julia Silge, who taught me more about PCA in 20 minutes than two degrees and 6 years of statistics education did (no offence) – my feelings are summarised by that much. You can see her slides here.

There are a couple of repos collecting all the talks, here and here.

Overall, this was such an amazing conference. I had so much fun meeting and seeing friends from R-Ladies and Twitter; it truly felt like I had friends all around me, hard to accomplish with 1100 strangers abound.. Until next year in Austin!