--- title: "Underrated Tidyverse Functions" author: "Ted Laderas" date: "12/1/2020" output: html_document: toc: true toc_float: true --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(tidyverse) tweets <- read.csv("tweets_long.csv") DT::datatable(tweets) ``` # The Assignment I'm teaching an R Programming course next term. Jessica Minnier and I are developing the [Ready for R Materials](https://ready4r.netlify.app/labbook) into a longer and more involved course. I think one of the most important things is to teach people how to self-learn. As learning to program is a lifelong learning activity, it's critically important to give them these meta-learning skills. So that's the motivation behind the *Tidyverse function of the Week* assignment. I asked on Twitter:

Hi Everyone. I'm teaching an #rstats course next quarter.

One assignment is to have each student write about a #tidyverse function. What it's for and an example.

What are some less known #tidyverse functions that do a job you find useful?
— Ted Laderas, PhD 🏳️‍🌈 (@tladeras) November 30, 2020

## Some of my favorite suggestions Here are some of the highlights from the thread. I loved all of these. Danielle Quinn wins the MVP award for naming so many useful functions:

dplyr::uncount()
tidyr::complete()
tidyr::fill() / replace_na()
stringr::str_detect() / str_which()
lubridate::ymd_hms() and related functions
ggplot2::labs() - so simple, yet under appreciated!
— Danielle Quinn (she/her) (@daniellequinn88) December 1, 2020

`fill()` was highly suggested:

tidyr::fill() - extremely useful when creating a usable dataset out of a spreadsheet originally built for data entry, in which redundant informations are only reported once at the beginning of the group they refer to, rather than in every row as needed for the analysis.
— Luca Foppoli (@foppoli_luca) December 1, 2020

Many people suggested the window functions, including `lead()` and `lag()` and the cumulative functions:

Check out the dplyr window functions, cummin, cummax, cumany and cumall. They don't seen useful at first but they can solve really tricky aggregation problems. https://t.co/aDpXqSB2Vx
— Robert Kubinec (@rmkubinec) December 1, 2020

Alison Hill suggested `problems()`, which helps you diagnose why your data isn't loading:

Ooh problems is a good function for importing rx https://t.co/P4ZR57PgOG
— Alison Presmanes Hill (@apreshill) December 1, 2020

I think that `deframe()` and `enframe()` are really exciting, since I do this operation all the time:

tibble::deframe(), tibble::deframe()
coercing a two-column df to named vector, which I prefer immensely to names(df) <- vec_of_names
— E. David Aja (@PeeltothePithy) December 1, 2020

`unite()`, `separate()` and `separate_rows()` also had their own contingent:

I find myself using tidyr::unite() a lot to clean messy data - particularly useful for making unique and informative ID's for each row. coalesce() and fill() are also little known gems! :)
— Guy Sutton🐝🌾🇿🇦🇿🇼 (@Guy_F_Sutton) December 1, 2020

## Wow! Let's Grab All the Tweets and Replies I was bowled over by all of the replies. This was an unexpectedly really fun thread, and lots of recommendations from others. I thought I would try and summarize everyone's suggestions and compile a list of recommended functions. I used [this script](https://github.com/gmellini/twitter-scraper) with some modifications to pull all the replies to my tweet. In particular, I had to request for `extended` tweet mode, and I extracted a few more fields from the returned JSON. This wrote the tweet information into a CSV file. Then I started parsing the data. I wrote a couple of functions, `remove_users_from_text()`, which removes the users from a tweet (by looking for words that begin with `@`) and `get_funcs()`, which uses a relatively simple regular expression to try to return the function (it looks for paired parentheses `()` or an underscore "-" to extract the functions). It actually works pretty well, and grabs most of the functions. Then I use `separate_rows()` to split the multiple functions into their separate rows. This makes it easier to tally all the functions. ```{r} remove_users_from_text <- function(col){ str_replace_all(col, "\\@\\w*", "") } get_funcs <- function(col){ out <- str_extract_all(col, "\\w*\$\$|\\w*_\\w*") paste(out[[1]], collapse=", ") } parsed_tweets <- tweets %>% rowwise() %>% mutate(text = remove_users_from_text(text)) %>% mutate(funcs = get_funcs(text)) %>% ungroup() %>% separate_rows(funcs, sep=", ") %>% select(date, user, funcs, text, reply, parent_thread) %>% distinct() write_csv(parsed_tweets, file = "cleaned_tweets_incomplete.csv") knitr::kable(parsed_tweets[1:10,-c(5:6)]) ``` At this point, I realized that I just needed to hand annotate the rest of the tweets, rather than wasting my time trying to parse the rest of the cases. So I pulled everything into Excel and just annotated the ones which I couldn't pull from. ## Functions by frequency Here are the function suggestions by frequency. Unsurprisingly, `case_when()` (which I cover in the main course), has the most number of suggestions, because it's so useful. `tidyr::pivot_wider()` and `tidyr::pivot_longer()` are also covered in the course. There are some others which were new to me, and a bit of a surprise, such as `coalesce()`, `fill()`. ```{r} cleaned_tweets <- read_csv("cleaned_tweets.csv") %>% select(-parent_thread) %>% mutate(user = paste0("[",user,"](",reply,")")) %>% select(-reply) functions_by_freq <- cleaned_tweets %>% janitor::tabyl(funcs) %>% filter(!is.na(funcs)) %>% arrange(desc(n)) write_csv(functions_by_freq, "functions_by_frequency.csv") functions_by_freq %>% knitr::kable() ``` ## Cleaned Tweets and Threads Here's all of the tweets from this thread (naysayers included). They are in somewhat order (longer threads are grouped). Here's a [link to the cleaned CSV file](cleaned_tweets.csv) ```{r message=FALSE} knitr::kable(cleaned_tweets) ``` ## Source Code and Data Feel free to use and modify. - [RMarkdown file](index.Rmd) used to generate this post - [Python Twitter Scraper (by Giovanni Mellini)](https://github.com/gmellini/twitter-scraper) - I used this because there wasn't a ready made recipe in `rtweet` to extract replies - you have to use recursion to extract all of the thread replies that belong to a tweet, and this was easily modifiable. - [Cleaned Tweets File (CSV)](cleaned_tweets.csv) ## Thank You This post is my thank you for everyone who contributed to this thread. Thank you!