Simple Options Parser for PhantomJS

18. November 2011 14:33 by Cameron in javascript, PhantomJS, Programming  //  Tags: , , ,   //   Comments

Recently I needed a way to parse command line options with PhantomJS. I didn't see anything else on the web that allowed for abitrary ordering of command line arguments to PhantomJS scripts so I made my own. Here's the code for those interested:

// argument results
var a1, a2, a3, a4;

function optionParser() {	
	var opt = 0;
	while((opt < phantom.args.length) && (phantom.args[opt][0]=='-')) {
		var sw = phantom.args[opt];
		switch(sw) {
			case '-a1':
				opt++;
				a1 = phantom.args[opt];
				break;
			case '-a2':
				opt++;
				a2 = phantom.args[opt];
				break;
			case '-a3':
				opt++;
				a3 = phantom.args[opt];
				break;
			case '-a4':
				opt++;
				a4 = phantom.args[opt];
				break;
			default:
				console.log('Unknown switch: ' + phantom.args[opt]);
				phantom.exit();
				break;
		}
		opt++;
	}
}

This can easily be modified to work with an array of argument results or you can simply read in each argument into its own variable. Also, you can read in integers and in your application logic, use isNaN() to check if the input is a valid integer.

Passing information to and from webpages in PhantomJS

27. September 2011 11:01 by Cameron in javascript, PhantomJS  //  Tags: , , , , ,   //   Comments

Recently, I needed a way to pass dynamic content to and from webpages using PhantomJS as part of writing my screen scraper. I need the scraper to follow dynamic sets of links and scrape the data from each page. Since a webpage's scope is currently sand boxed, I had to find a way to pass data to and from webpages. With the addition of the new filesystem module in PhantomJS 1.3, it is now possible to pass data from the main scope to an individual page's scope. Any data that you want passed to a particular page should be saved as a javascript string to a javascript file. Then you can inject the javascript into the page on page.onLoadFinished so that the data is then accessible within the page's scope. For example:

var page = require('webpage').create(), 
     fs = require('fs'), 
     data = "var dataObject = { item: 'value' };", 
     fullpath;

fullpath = fs.workingDirectory + fs.separator + 'data.js';
// open file for writing
var dataFile = fs.open(fullpath, 'w');
dataFile.write(data);
dataFile.close();

// check that the file was successfully written
if(fs.size(fullpath) > 0) {
	console.log('File wrote successfully!');
	page.open('http://somesite.org/page.html');
	// put page data in a local variable
	var output = page.evaluate(function () {
		// print the output of the data object
		console.log(dataObject.item);
		return dataObject.item;
	});
	// output should be the same value as the page's dataObject.item
	console.log(output);
}
else {
	console.log('Error in writing the file!');
	phantom.exit();
}

page.onLoadFinished = function() {
	// inject the javascript data that we created earlier
	page.injectJS(fullpath);
}

For more information about PhantomJS' File System module, please visit: http://code.google.com/p/phantomjs/wiki/Interface#Filesystem_Module

While this solution may not be the best long term solution, it does provide a way to get data to and from your pages until official support for passing data to a webpage object becomes available in PhantomJS.

Take Screenshot of all HTML documents in a folder using PhantomJS

26. September 2011 01:14 by Cameron in javascript, PhantomJS, Programming  //  Tags: , , , ,   //   Comments

Recently I came across a question on stackoverflow that asked about how to take screenshots of all HTML files in a local folder. I have been playing with PhantomJS quite a bit lately and felt comfortable answering the question. Here is the code for those interested:

var page = require('webpage').create(), loadInProgress = false, fs = require('fs');
var htmlFiles = new Array();
console.log('working directory: ' + fs.workingDirectory);
var curdir = fs.list(fs.workingDirectory);

// loop through files and folders
for(var i = 0; i< curdir.length; i++)
{
	var fullpath = fs.workingDirectory + fs.separator + curdir[i];
	// check if item is a file
	if(fs.isFile(fullpath))
	{
		if(fullpath.indexOf('.html') != -1)
		{
			// show full path of file
			console.log('File path: ' + fullpath);
			htmlFiles.push(fullpath);
		}
	}
}

console.log('Number of Html Files: ' + htmlFiles.length);

// output pages as PNG
var pageindex = 0;

var interval = setInterval(function() {
	if (!loadInProgress && pageindex < htmlFiles.length) {
		console.log("image " + (pageindex + 1));
		page.open(htmlFiles[pageindex]);
	}
	if (pageindex == htmlFiles.length) {
		console.log("image render complete!");
		phantom.exit();
	}
}, 250);

page.onLoadStarted = function() {
	loadInProgress = true;
	console.log('page ' + (pageindex + 1) + ' load started');
};

page.onLoadFinished = function() {
	loadInProgress = false;
	page.render("images/output" + (pageindex + 1) + ".png");
	console.log('page ' + (pageindex + 1) + ' load finished');
	pageindex++;
}

The process is quite simple. First, I loop through all objects in the current working directory and check to see if each item is a file and whether it has the .html extension. Then I add each html file's filepath to an array that I later loop through to take the screenshots. A screenshot must be taken after the page is fully loaded so that the screenshot will contain actual content and not a blank image. This is done by saving the image on the page.onLoadFinished callback. The application loop for taking the screenshots inserts small 250ms delays between each request so that pages may fully load into the headless browser before advancing to the next page.

XBox Live Data

20. September 2011 14:32 by Cameron in Programming, Xbox Live  //  Tags: , , , , , , , , , , ,   //   Comments

While my gaming social networking site, IGA: International Gamers' Alliance, is still under beta, I have been looking at ways to provide a more rich experience for my users. Lately I've been working on a way to gather data from XBox Live so that I can provide content to my users on IGA. I used to have a way to gather data from a RESTful API, using the official XBox Live API, that Microsoft employee, Duncan Mckenzie, used to host on his website. However, his service is no longer available. While there is an official XBox Live API, access to this API is restricted to those who are in the XBox Community Developer Program. Acceptance into this the XBCDP is very limited at the moment and it seems that only well known companies with sponsors receive membership into the program. 

While it would be very nice to get official access to the XBox Live API, it may be a while until I can get into the program. My social networking site, IGA, is still in beta and has much to be done on the roadmap to completion. Currently I am the only developer for the project and I am also in school so development is slow. Maybe once IGA is closer to completion, Microsoft will be more eager to accept me into the program. In the meantime, I have a solution for gathering data from XBox Live.

There are a couple of places to get data from XBox Live. There is the publicly available user's gamercard and the user's protected XBox.com profile. Getting data from the public gamercard is very easy. One could write a parser in PHP, C#, or even jQuery to get the different values from the HTML elements on the page. Retrieving data from a user's XBox.com profile requires a little more skill and resources. You cannot simply use cURL to remotely login to XBox.com since it has anti-bot mechanisms in place to check against the browser agent, browser cookies, and many other aspects that can't easily be manipulated with cURL. There is a remedy to this problem however.

This past summer, I learned about a headless webkit browser called PhantomJS from some co-workers while working on a project at work. We needed something that could run without a GUI on a server that could manipulate the DOM of a webpage. PhantomJS gave us exactly what we needed. After working on the project at work, it occurred to me that I could use PhantomJS in addition to jQuery to manipulate the DOM and screen scrape data from XBox.com.

I'm currently working on scripts to pull data from users' profiles including the users' games, the achievements earned in each game, and more information not publicly available on users' gamercards. Please understand though that screen scraping should only be done on a last resort and it is taxing on both ends to make numerous requests per day. I will implement some sort of data caching that will pull new data on a schedule to limit bandwidth usage. I plan to release this code to my Git hosting when it is finished. 

Month List

Tag cloud