Working for a deep-pocketed digital publishing innovator doesn’t mean you don’t have to solve the same frustrating reporting problems as smaller companies and agencies. On the other side of that coin, sometimes a simple analytics solution is best — even for a large enterprise.
In this article, we walk you through automating an attribution report using R Shiny and Google Cloud Computing. We provide simple and practical relief to diligent analysts stuck in the office late on Friday afternoons, manually wrangling data in and out of spreadsheets to make dashboards in time for distribution first thing Monday morning.
We salute you, keepers of the Friday night flame, crusaders of the pivot table, champions of the VLOOKUP() and 8-layer IF() formulas. We want to tell your story and provide some help.
A web analytics client’s marketers found themselves in this situation until some simple automation helped them reclaim their weekends and focus on more interesting tasks during the week. They functioned as a fledgling startup deep within the walls of a financial publisher, responsible for growing international subscriptions. However, they were also tasked with churning out routine reports which required a disproportionate level of manual extraction, transformation, and loading (ETL) in order to populate what were, in the end, relatively simple dashboards.
The root of the problem certainly wasn’t lack of a robust marketing stack of enterprise-grade resources like Google Analytics 360 and powerful Business Intelligence (BI) platforms. The real obstacle was a common one — all the attention and resources were focused on new product initiatives, leaving little time and focus for minding the analytics engine room and keeping up with growing demands for content performance reports from the team’s editorial and business stakeholders.
Specifically, publishers in each market needed regular reports on these important but otherwise routine media analytics questions on a weekly basis:
- What were my best performing articles by publish date?
- What is the gap between the most popular vs. most engaging articles?
- What articles and authors drive the most conversions and revenue?
- What articles that are viewed FIRST in a session drive the most conversions and revenue ?
- How are email signups/conversions performing over time and by source/medium across all of my sites?
To produce these reports, the analytics team aggregated article metadata like publication date, author, and title in Google Analytics Custom Dimensions at both hit and session scopes across multiple websites and GA Properties. Further, as is often the case with inherited analytics accounts, each legacy GA Property had been set up differently and was populating values into different Custom Dimension slots. Needless to say, there was a lot of joining going on across multiple fields.
Further complicating things, all of that tedious, manually-joined data was ending up in spreadsheets for analysis and reporting. I say “all” but the reality is that only a few weeks worth of data was being analyzed at any given time due to limitations in the number of rows allowed in an Excel spreadsheet.
When we hear that data scientists devote 80% of their time to simply collecting and preparing data before any real analysis begins, we remember situations like this are all too common. The solution outlined below solves these challenges and can be easily adapted to fit other complex requirements.
An Elegant, Affordable Solution
So, how does a busy team on a fixed budget without a dedicated resource deliver volumes of daily and weekly reports to stakeholders and still have time for delivering insights needed by management?
Here’s how we did it, and how you can too with just a little bit of tweaking for your particular setup. We promise this isn’t hard. If it starts to feel too technical, get a little help from your analytics team or give us a call.
Empirical Path put together a simple cloud-hosted dashboard solution using a short stack of affordable and open source technology. Using Google Cloud Compute Engine VM instance, some R Shiny, Flexdashboard, and Mailgun kept report consumers supplied with elegant, interactive dashboards delivered on-demand and weekly via recurring email summaries.
The solution is affordable and portable:
- For 5-10 simultaneous report users, the application costs around $25/month.
- Easy portability and scalability means it adapts to changing requirements
- Open source technology means security and loads of community-contributed help.
- It works consistently with Google Analytics Standard or the Analytics 360 Suite.
- It lives locally or in a Google Cloud Platform (GCP) project.
- Up to 10,000 R Markdown reports can be emailed on a schedule each month for free.
Let’s Build It
Let’s walk through the build-host-automate process step-by-step. Note, we’re skipping right to the fun part here. All of this assumes you have already done your homework — defining your KPIs and auditing your data collection.
Step 1: Make Your Mockup for Real Users
Gather your end-user requirements and sketch up a dashboard wireframe, identifying components and layout. We always recommend running it past a few actual human beings — ideally report consumers — to ensure you’re building for the real world.
Here’s what ours looked like:
It’s an optional pro tip, but unless you are doing this as a personal project, create a spec sheet, noting decisions and details that went into your dashboard design, so the next person that comes along can pick up where you left off instead of starting all over again.
Step 2: Write Your Data Collection Code
Code is needed to perform data collection and extraction. For this, you’ll need a bit of working knowledge of R to:
- Setup R on your local machine
- Install Git on your local machine (and one of the free GUI’s like GitHub Desktop client or SourceTree)
- Write the following (or download ours from the GitHub repo):
- functions.R – the engine
- extract.R – the ignition, pedals and steering wheel
- Copy our script below and tweak it by doing the following:
- Add your GA View IDs to `extract.R` file in the `list_of_views_to_get` list
- Create an equivalent lookup table for each GA View added to map the custom dimension index to the pretty name in the `config.R` file
We wrote one script that could take a list of GA View IDs, grab both hit and session-level data from GA and save the results separately for each dashboard.
- Serves as a backup to ensure the dashboards always receive unsampled data.
- Provides unsampled data via manual exports via web UI or GA Core Reporting API for ad-hoc analysis.
Before moving on, validate your script is working the way you want it by running your code, then comparing results to the GA web UI. Once you are getting the data you want and can verify it’s accurate (i.e., unsampled), proceed to building out your dashboard interface, controls, and email distribution.
Step 3: Building the Dashboard Interface
Build your front end — your dashboard code, summary reports, and email templates.
For each GA View used in the ETL code above, add a line to each of the following variables (e.g., all the `switch()` functions in the `index.R` file for the following so the report displays the correct values):
Each dashboard is it’s own script (RMarkdown file), and due to some nifty R code writing, we were able to create a template structure, which worked great for this team’s requirements. And made it easy for us to make updates to all 5 dashboards virtually simultaneously by just copy/pasting the files across and changing a single parameter: the GA View ID.
Here’s an illustration of the directory structure we used (note: imagine `website1` = `demo` folder when looking at our Git repo):
│ ├── website1
│ │ ├── index.Rmd
│ │ ├── email.Rmd
│ ├── website2
│ │ ├── index.Rmd
│ │ ├── email.Rmd
│ ├── website3
│ │ ├── index.Rmd
│ │ ├── email.Rmd
│ ├── README.md
│ ├── gaAccountInfo_demo.rds
│ ├── top25_demo.rds
│ └── viewId_demo.rds
│ └── functions.R
We also leveraged the R Flexdashboard package since it offered out-of-the box features for the dashboard layout and UI. This allowed us to save time and focus on the meat of the backend data ETL.
Here’s what we ended up with. You can try a live demo of this dashboard.
Step 4: Hosting
Once you’re satisfied with the dashboard in your local environment, it’s time to move it to the cloud. There are plenty of ways to host your application, but as a Google Cloud Platform Partner, we were happiest with the following setup.
- Push code to git repository (we use GitHub, but BitBucket works too)
- Setup your GCP Project, Billing enabled
- Setup GCE Firewall Rules (to allow allow all)
- Create and start a GCE VM (< $25/mo)
- Size: 1 vCPU & 3.75GB RAM
- OS: Ubuntu 16.10 (GNU/Linux 4.8.0-46-generic x86_64)
- Configure GCE VM – Install OS dependencies (git, etc), install R, Shiny Server, all R packages
- Clone Git repo on instance in a folder under the directory “/srv/shiny-server” via:
sudo git clone https://github.com/username/reponame
- Run the R code to extract the data. For example:
$ sudo Rscript /srv/shiny-server/gce-rshiny-ga-article-attribution/dashboards/extract.R
- Go to the url “123.456.78:3838/gce-rshiny-ga-article-attribution/dashboards/” (where 123.457.78 = your VM’s IP as found in Google Dev Console > Google Compute Engine)
Here are some other hosting options for hosting and deploying Shiny apps that cost more, but don’t require as much lifting during this step:
Step 5: Automating Data Refreshes
Once everything is hosted and ready to go, you have several options. For more details, see this post: 4 Ways to schedule R scripts in the Google Cloud.
We chose plain old cron jobs and set up the following on the instance:
Shell into the instance via (without the $!)
$ gcloud compute --project "project-id" ssh --zone "us-east1-b" "instance-name"
…then access the crontab file:
$ sudo crontab -e
Then copy/paste this line:
0 10 * * * cd /srv/shiny-server/dashboards; Rscript extract.R
This runs the script once a day at 1000 UTC (or 0600 EDT)
Step 6: Set Up Automated Emails
Once we built the dashboards, we set up email delivery using these free or inexpensive resources:
- A subdomain via Google Domains ($9.99/year)
- Mailgun (free up to 10,000 emails/month)
If you’re new to Mailgun, here are a few links you might find helpful for setting this up.
- Create free Mailgun account
- Add a domain to Mailgun
- Verify a domain
- Set up a mailing list called “Weekly Reports”
And, for reference, here is Mark Edmonson’s suggested code snippet, which we adapted for our purposes.
We won’t get too far into the weeds here, but with a bit more knowledge of devops tools, you can really start optimizing your analytics and reporting workflows and create easy-to-clone environments for limitless possibilities.
Wouldn’t you rather spend your weekend playing around with RStudio than creating manual reports in Excel? We think so.
Here are a few links to get you started.
- R – https://www.r-project.org/
- RStudio – https://www.rstudio.com/
- Shiny – https://shiny.rstudio.com/
- Shiny Server – https://www.rstudio.com/products/shiny/shiny-server/
- Docker – https://www.docker.com/
- Rocker/Shiny – https://hub.docker.com/r/rocker/shiny/
- googleComputeEngineR – https://cloudyr.github.io/googleComputeEngineR/articles/docker.html
- googleAuthR – http://code.markedmondson.me/googleAuthR/
The benefits of adopting this workflow include:
- Scalable and easy to expand on basic functions to improve existing dashboards, copy for new websites or create entirely new
- Compatible with other cloud platforms like AWS, Azure and more
- Foundation for more advanced reporting with R packages
A Final Tip:
Don’t forget to use your delivery setup to automate some error alerts and notifications for yourself. Computers mess up, so we added error logging and alerts via the Slackr package. That way, when a data export fails or someone accidentally unplugs the servers, we get one or more of the following alerts in a shared Slack channel:
- “@empirical-path. something is wrong with dashboard X! Please take a look. Here is a direct link to the report.”
- “@empirical-path. the email for website Y failed to send! Please check this out ASAP!”
Notifications & Updates:
- “@empirical-path data refresh for dashboard X was successful so you can rest easy. Here is a direct link to the report.”
- “@empirical-path the email for website Y was sent successfully! Check your inbox.”
In Closing, A Shout Out to Mark
We hope you found this helpful and that it inspires you to explore how this might help you save time and get your Friday evenings back. Please feel free to contact me at www.empiricalpath.com or @empiricalpath with questions.
We would be remiss not to personally thank Mark Edmondson for his contribution to this project. He is the developer of the R package, googleanalyticsR, that made all this possible (which shouldn’t be that surprising once you find out how many other helpful R packages he’s created). Give him a follow on Twitter, GitHub or subscribe to his helpful blog here.