Content Audit 2017/2018: Data Collection


Like many creatives we can generate a lot of content … and much of the time we lose track of it because we’re off focusing on creating the next great thing.

This can turn into multiple problems:

  • time spent recreating resources that we already have
  • resources (pdf’s, slide decks, webinars, wordpress posts and pages, landing pages, drip sequences) with
    • old branding
    • outdated call’s to action
    • broken links
    • poor or off-message SEO
  • missed opportunities in providing free or paid resources that we’ve simply forgotten about

A further motivation in our case was the desire to morph existing content into LearnDash courses for our membership site launch.

In our case we needed to capture as much of the text contents as possible of pdf’s, doc/docx files, landing pages, and email messages spanning Ontraport, Google Drive, Amazon S3 storage, 2 wordpress sites, Instant Teleseminar, and two separate Macintoshes. Obviously this exercise could have gone deeper and handled more file types, but this was sufficient for our first pass needs.

This post is intended to show you some techniques for collecting the data on the off chance that it will help speed your own efforts.

All data sources were pulled into CSV files.

Ontraport

Goal: Capture text contents of all messages and landing pages

Setup step: download and install the Ontraport API client library (https://github.com/Ontraport/SDK-PHP) and html2text (https://github.com/soundasleep/html2text)

Grab all messages including the message body, strip html, e.g.

bash$ php ./get-messages.php >messages.csv

get-messages.php:
<?php 
    require_once('./SDK-PHP-master/src/Ontraport.php'); 
    require_once('./html2text-master/html2text.php'); 
    use OntraportAPI\Ontraport; 
    $client = new Ontraport("APPID","KEY"); 
    $requestParams = array( "search" => "members only");
    $response = $client->message()->retrieveMultiple($requestParams);
    $resp=json_decode($response);
    $out = fopen('php://output', 'w');
    fputcsv($out, array('id','subject','message body'));
    foreach ($resp->{'data'} as $msg) {
         fputcsv($out, array($msg->id,$msg->subject,convert_html_to_text($msg->message_body)));
    }
    fclose($out);
?>

Grab all landing pages, pull the page itself, strip html, e.g.

bash$ php ./get-pages.php >landingpages.csv

get-pages.php:
<?php 
    // TBD - add actual rate limiting instead of cheesy sleep 
    require_once('./SDK-PHP-master/src/Ontraport.php'); 
    require_once('./html2text-master/html2text.php'); 
    use OntraportAPI\Ontraport; 
    $client = new Ontraport("APPID","KEY"); 
    $out = fopen('php://output', 'w'); 
    $requestParams = array( ); 
    $response = $client->landingpage()->retrieveCollectionInfo($requestParams);
    $resp=json_decode($response);

    $count=$resp->{'data'}->{'count'};
    $counter=0;
    fputcsv($out, array('id','name','public URL','working URL','page text'));
    while ($counter < $count) { $requestParams = array( "sort" => "name",
            "sortDir"    => "asc",
            "start"      => $counter
            );
	$response = $client->landingpage()->retrieveMultiple($requestParams);
	$resp=json_decode($response);
	foreach ($resp->{'data'} as $msg) {
           $rp2 = array(
	       "id"      => $msg->id
           );
           $response2 = $client->landingpage()->getHostedURL($rp2);
           $resp2=json_decode($response2);

           $ch = curl_init($resp2->{'data'});

           curl_setopt($ch, CURLOPT_HEADER, 0);
           curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

           $output=curl_exec($ch);
           curl_close($ch);

           fputcsv($out, array($msg->id,$msg->name,$msg->domain,$resp2->{'data'},convert_html_to_text($output)));
	}
        $counter+=50;
	// Avoid rate limiting of 180/minute. This sleep is overkill since we're doing 51/iteration
	sleep(30);
    }
    fclose($out);
?>

WordPress

Goal: Capture text contents of all posts and pages on www.pathtoprofitacademy.com and www.minetteriordan.com

This one was pretty straightforward: just install the “WP All Export” plugin and do exports of pages and posts from each site. Make sure you add the post/page content to the export, and also make sure you use the same columns and column order for each export. I didn’t bother to post-process the output to strip out html from the content column but that would be a nice addition. Another option for this would be wp-cli which would allow some more flexibility with scripting.

Amazon S3

Goal: Capture file names only of all files in our S3 buckets

aws cli to the rescue! The AWS command line interface works great for scripting collection of information from your buckets. You are on your own for setting up users, permissions, etc. I did some basic scripting to go through all the buckets and capture something in csv formation.

bash$ ./s3audit.sh >s3files.csv

s3audit.sh:

#!/bin/bash                                                                                                                        
echo "\"Date\",\"Size\",\"Filename\""
buckets=`~/Library/Python/2.7/bin/aws s3 ls | awk '{ print $3 }'`
for i in $buckets; do
    ~/Library/Python/2.7/bin/aws s3 ls $i --recursive | awk -v X=https://s3-us-west-2.amazonaws.com/$i \
'{ \                                                                                                                               
  printf("%s,%s,",$1,$3); \                                                                                                        
  printf("\"%s/",X); \                                                                                                             
  for (i=4; i<=NF; i++) { \                                                                                                        
    printf("%s",$i); \                                                                                                             
    if (i<NF) printf(" "); \                                                                                                       
  } \                                                                                                                              
  printf("\"\n"); \                                                                                                                
}'
done

Google Drive

Goal: Capture file names only of all files in our Google Drive

Similar to AWS there is a command line interface to Google Drive – gdrive. I just used this in a very basic (cheezy?) fashion to pull filenames for our review. You could do further work to pull files down and get actual contents.

bash$ ./gdrive-osx-x64 list --absolute --name-width 0 --max 1000 --order name --query "trashed = false" | colrm 1 84 | colrm 115 1000 | sort -ub >gdrive.txt

MacOS

Goal: Capture file names and text contents of all pdf, doc, and docx files on 2 separate Mac’s.

I thought I was going to use neoFinder for this because it’s pretty awesome, but it wouldn’t export file contents. We will still use neoFinder for it’s search abilities as a Digital Asset Manager but for now I just wrote some scripts.

Setup step: download and install PDF to text.

bash$ ./localfiles.sh ~/Documents ~/some_other_dir >localfiles.csv

localfiles.sh:
#!/bin/bash                                                                                                                        
echo \"Filename\",\"Text\" >localfiles.csv
find $* -name "*pdf" -exec ./localpdffiles.sh {} \;
find $* -name "*doc" -o -name "*docx" -exec ./localdocfiles.sh {} \;

localpdffiles.sh:

#!/bin/bash                                                                                                                        
pdftotext "$*" temp.txt 2>/dev/null
echo -n \"$*\",\" >>localfiles.csv
cat temp.txt | sed 's/"//g' 2>/dev/null >>localfiles.csv
echo \" >>localfiles.csv
localdocfiles.sh:
#!/bin/bash                                                                                                                        
textutil -convert txt -output temp.txt "$*" 2>/dev/null
echo -n \"$*\",\" >>localfiles.csv
cat temp.txt | sed 's/"//g' 2>/dev/null >>localfiles.csv
echo \" >>localfiles.csv

Instant Teleseminar

Goal: Just get the names and links of existing recordings out of their tool into a spreadsheet.

Fortunately they have their main list of your webinars and recordings in an html table: I was able to copy and paste this directly from the web page into a Google Sheet.

Data representation and searching

Goal: Come up with something simple to allow searching of all of the data

This turned out to be easier than I thought. I was planning to try and import things into MySQL for better searching but the boss wanted to be able to scan things easily. My Mac’s Numbers and MS Excel programs weren’t super-happy importing the larger csv files but Google Sheets worked like a charm. I simply imported each csv into a separate sheet (tab at bottom), labeled each sheet, and put some simple instructions on the first sheet.

 

Depending on how the data analysis step goes I’ll either update this post or create a new one. Like I said, hopefully this gives you a little bit of a starting point for any data collection you need to do on your own resources.

Brad Dobson
Brad Dobson is a co-founder of the Path to Profit Academy, and husband of Minette Riordan. He handles all the techy stuff and shares parenting duties. He is a 2-time marathon and 2-time Ironman finisher and for some reason enjoys endurance athletics. After 25 years in the software industry he quit his job to become an entrepreneur alongside Minette.

About Brad Dobson

Brad Dobson is a co-founder of the Path to Profit Academy, and husband of Minette Riordan. He handles all the techy stuff and shares parenting duties. He is a 2-time marathon and 2-time Ironman finisher and for some reason enjoys endurance athletics. After 25 years in the software industry he quit his job to become an entrepreneur alongside Minette.

Leave a comment

Your email address will not be published. Required fields are marked *