-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathqueue_updates.Rmd
More file actions
174 lines (131 loc) · 5.94 KB
/
Copy pathqueue_updates.Rmd
File metadata and controls
174 lines (131 loc) · 5.94 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
title: "Custom Scheduled Updates"
output: github_document
---
## Motivation
The purpose of this analysis is to get an estimate of the percentage of Buffered posts that are shared now, shared to the queue, and scheduled at a custom time. It is in reference to [this issue](https://github.com/bufferapp/data-team/issues/63).
## Methodology
Because there are many users that share many updates through Buffer, we will take a small sample of users that shared updates in Buffer _in the past 7 days_. We will then determine the type of share by looking at the full scope of the `actions_taken` event associated with the update. Then we can run some summary statistics and visualize the breakdown.
## Data collection
It should be a fairly straightforward SQL query to get this data. Let's try it by querying `actions_taken`.
```{r include = FALSE}
library(dplyr); library(buffer); library(ggplot2)
```
```{r include = FALSE}
con <- redshift_connect()
```
```{sql connection=con, output.var=updates}
select
a.user_id
, a.full_scope
, count(distinct a.id)
from actions_taken as a
inner join users
on a.user_id = users.user_id
where full_scope like '%shared now%'
and date > (current_date - 29)
group by a.user_id, a.full_scope
```
```{r}
# Save the data
# save(updates, file = "queue_updates.Rda")
# Load the data
load('queue_updates.Rda')
```
## Determine update types
Let's quickly examine the different values of `full_scope`.
```{r}
updates %>%
group_by(full_scope) %>%
summarise(count = n()) %>%
arrange(desc(count))
```
Hm. It seems like the last two words in the full scope in particular indicate whether the update was shared at a custom time, added to the queue, or shared now.
If the `full_scope` contains "now", the update was shared now. Let's start with that.
```{r}
# Create new column
updates$type <- character(nrow(updates))
# Get index of updates that contain 'now' in full_scope
now_index <- grep("now", updates$full_scope, ignore.case = T)
# Set the type of these rows to 'now'
updates$type[now_index] = "now"
```
Cool, now let's do the same for the word "schedule"
```{r}
# Get index of updates that contain 'schedule' in full_scope
schedule_index <- grep("schedule", updates$full_scope, ignore.case = T)
# Set the type of these rows to 'custom_schedule'
updates$type[schedule_index] = "custom_schedule"
```
Now let's make an assumption: the rest of the updates were sent to the queue (in whichever slot).
```{r}
# Set remaining types to 'queue'
updates[updates$type == "", ]$type <- 'queue'
```
Great! I think we're ready now.
## Exploratory analysis
Let's look at the overall percent of updates by type.
```{r}
# Group by update type
updates %>%
group_by(type) %>%
summarise(users = n_distinct(user_id), updates = sum(count)) %>%
mutate(updates_percent = updates / sum(updates)) %>%
arrange(desc(updates_percent))
```
Cool, so around 51% of all updates sent in the past 7 days were sent to the queue, around 37% were sent via a custom schedule, and around 12% were sent now.
**_This might not tell the whole story however, as these percentages might be disproportionately influenced by users that send large numbers of updates._** It might be worth getting this breakdown for _each_ user, and then analyzing the distribution of that breakdown.
### Get breakdown per user
We want to group the updates by user and type of update. We first need to tidy the data a bit and make sure each user has a value for each type. Let's also only look at users that scheduled at least 5 updates
```{r}
# Group by user
by_user <- updates %>%
group_by(user_id, type) %>%
summarise(updates = sum(count)) %>%
filter(updates >= 5)
```
Now we need to make sure the users have values for all three types of updats.
```{r}
library(tidyr)
# Spread the update types
spread <- by_user %>%
spread(type, updates)
# Replace the new NAs with 0
spread[is.na(spread)] <- 0
# Now gather the updates into tidy format again
long <- spread %>%
gather(type, updates, custom_schedule:queue)
```
There we go! Now for each user, we can calculate the proportion of updates for each type.
```{r}
# Calculate proportions for each user
proportions <- long %>%
group_by(user_id, type) %>%
summarise(updates = updates) %>%
mutate(proportion = updates / sum(updates))
```
Now let's look at the plots of the distributions.
```{r warning = FALSE, message = FALSE, echo = FALSE}
library(gridExtra)
# Custom plot
custom_plot <- ggplot(filter(proportions, type == 'custom_schedule'), aes(x = proportion)) +
geom_density() +
theme(axis.text.y=element_blank(), axis.ticks.y=element_blank()) +
labs(x = 'Proportion of Updates Custom Scheduled')
# Queue plot
queue_plot <- ggplot(filter(proportions, type == 'queue'), aes(x = proportion)) +
geom_density() +
theme(axis.text.y=element_blank(), axis.ticks.y=element_blank()) +
labs(x = 'Proportion of Updates Sent to Queue')
# Now plot
now_plot <- ggplot(filter(proportions, type == 'now'), aes(x = proportion)) +
geom_density() +
theme(axis.text.y=element_blank(), axis.ticks.y=element_blank()) +
labs(x = 'Proportion of Updates Shared Now')
grid.arrange(custom_plot, queue_plot, now_plot, nrow = 2)
```
Interesting, for both custom-scheduled posts and posts sent to the queue, users either did it a lot, or not very much. For custom-scheduled posts, most users only custom scheduled for a very small percentage of updates. However, some users sent a very high proportion of posts via a custom schedule.
For queue updates, it's a similar story, except _more_ users sent the vast majority of updates to the queue. :)
## Conclusions
Most posts are sent directly to the queue, but a high proportion are also sent via a custom-schedule. The distribution doesn't seem to be even for users - either they send most directly to the queue or they send a low percentage of posts directly to the queue.
Most users send a small percentage of their posts via a custom schedule, but there are also users that send a high proportion of updates via a custom schedule. :)