scraping HTML/XML in openFrameworks

One of my students wants to get stock quotes from Yahoo Finance for his project, so I wrote a bit of code to do a HTTP GET and regular expression.  I am a bit out of touch with what has been going on with scraping/web-related stuff in oF, but I recently got a request for a scraper addon that I helped write forever ago, and I wasn’t able to quickly find anything much newer, so I thought this might be useful.  It actually takes advantage of the Poco library, which is now part of the standard oF distribution, so anyone can use it without getting any additional addons.

All I am doing here is using the HTTP stuff in Poco to download the source of a page (in this case, the Yahoo Finance page for Google), and then doing a regular expression on the result to get the particular bit of code that I am interested in (in this case, the stock price listed on that page). If you are dealing with well-formed XML (like RSS), you could also use the XML library that is part of Poco for more structured parsing — but that’s not what I need here.

Most of this was stolen from the poco distribution:  poco-x.x.x/Net/samples/httpget/src/httpget.cpp

First you have to import the right headers and declare the namespaces.  This should be done in testApp.h

#include "Poco/Net/HTTPClientSession.h"
#include "Poco/Net/HTTPRequest.h"
#include "Poco/Net/HTTPResponse.h"
#include "Poco/StreamCopier.h"
#include "Poco/Path.h"
#include "Poco/URI.h"
#include "Poco/Exception.h"
#include "Poco/RegularExpression.h"

using Poco::Net::HTTPClientSession;
using Poco::Net::HTTPRequest;
using Poco::Net::HTTPResponse;
using Poco::Net::HTTPMessage;
using Poco::StreamCopier;
using Poco::Path;
using Poco::URI;
using Poco::Exception;
using Poco::RegularExpression;

Then you can run this code wherever you need to in your testApp.cpp

try
{
	URI uri("http://search.yahoo.com/search?p=GOOG");
	std::string path(uri.getPathAndQuery());
	if (path.empty()) path = "/";

	HTTPClientSession session(uri.getHost(), uri.getPort());
	HTTPRequest req(HTTPRequest::HTTP_GET, path, HTTPMessage::HTTP_1_1);
	session.sendRequest(req);
	HTTPResponse res;
	istream& rs = session.receiveResponse(res);
	std::cout << res.getStatus() << " " << res.getReason() << std::endl;

	string result;
	StreamCopier::copyToString(rs, result);

	RegularExpression re("
  • ([0-9\\.]+)
  • "); RegularExpression::MatchVec matches; re.match(result, 0, matches); // result.substr(matches[0].offset, matches[0].length) -- contains the entire matched
  • // result.substr(matches[1].offset, matches[1].length) -- contains the subpattern inside the () cout << result.substr(matches[1].offset, matches[1].length) << endl; } catch (Exception& exc) { std::cerr << exc.displayText() << std::endl; exit(); }
  • The regular expression matcher returns a vector of Match objects, which just contain the offset and length of each match.  If you aren’t familiar with regular expressions, this probably doesn’t make much sense.  But, as it says in the code, you now have 2 matches that you can use as you please.
    result.substr(matches[0].offset, matches[0].length)  contains the entire matched

  • result.substr(matches[1].offset, matches[1].length)   contains the subpattern inside the ()

  •