Monday, December 29, 2014

Do Frameworks get in the way? A tale of Python and PayPal IPN.

I was writing some basic Python3 CGI code to handle PayPal IPN posts.   PayPal docs here.   The IPN message authentication protocol has PayPal first POST a message to your URL.  After sending back a quickie 200 OK in response to the POST, you then POST back the "complete, unaltered message", with "the same fields (in the same order) as the original message".  I'm not sure this really matters, as I have seen online code that doesn't seem to worry about ordering, that claims to work.  But, just to be robust, I wanted to follow the "correct" protocol.

If PayPal were sending a GET, one could use os.environ.get('QUERY_STRING').  But, for a POST, that returns None.  The Python cgi library provides a nice, standard, "handles lots of tricky cases" mechanism to read the POST fields, using cgi.FieldStorage().  However that returns a non-sorted dictionary, where the order is not preserved.  I reported this on a Stack Overflow question, and asked how one could get the data exactly as sent.  I mean, it was in a big String coming over the wire, e.g. "foo=bar&count=3",  right?  This should be simple.  HTTP can be complex, but this part isn't very tricky.

To my surprise, nobody answered, and not many people even viewed the question.  Maybe it was poorly worded.  I think the real reason might be that programmers are too used to using a library or framework, such as cgi.FieldStorage. or Django, and don't understand what's actually going on deep underneath.   Not picking on Python programmers here, I think the same is true in most languages.

After some playing around, the answer is astoundingly simple.  The POST data is coming over the wire as a String, so just read it from stdin.

query_string = sys.stdin.read()

To POST everything back to PayPal, use this simple code.  (Should I worry more about encoding?)

formData = "cmd=_notify-validate&" + query_string
req = urllib.request.Request(PAYPAL_URL, formData.encode())
req.add_header("Content-type", "application/x-www-form-urlencoded")
response = urllib.request.urlopen(req)
status = str(response.read())
if (not status == "b'VERIFIED'"):
    #complain/abort/whatever
else:
    #continue processing

There's one drawback: now you can't get cgi.FieldStorage() to work.  When it goes to read from stdin, there's nothing left, so it returns an empty dictionary.  (the Python cgi source code is here) .  So, it you also want the convenience of a dict for other purposes, such as checking on various IDs or the price they paid, you need to create your own dict.  But that is also trivial:

multiform = urllib.parse.parse_qs(query_string)

Just like cgi.FieldStorage(), this returns a dictionary where the values are lists of Strings, since it is possible for a key to be repeated in a query, e.g.  foo=bar&foo=car.  However, in practice, this is rare, and doesn't apply for the PayPal case.  I guess you could always ask for the 0th item in the list - FieldStorage has some special methods for this.  To simplify things, I created a nice, simple, single-valued form with Strings for keys and values:

form = {}
for key in multiform.keys():
    form[key] = multiform.get(key)[0]







Wednesday, November 12, 2014

Into the Clouds, deploying node.js with Modulus and OpenShift

My Agility website, www.nextq.info, is up and running on modulus.io.  I like Modulus.  It's easy to use, has been reliable, and you don't need to do a ton of heavy-duty Unix-ese command line stuff.  Their web interface does most of the work, and a simple command modulus deploy will update your codebase.  The main drawback is that they charge a small fee, $15 a month.  I haven't tried any scaling yet.

So lately I've also been playing with OpenShift.  It's free for small projects, and that even includes a little scaling.  It's definitely harder, more technical, and more "UNixy" than Modulus.  You deploy using git, and many commands must be done from the command line, not the web UI.  They have a free book to get you started, Getting Started with Openshift.  After some fiddling, I got things going.

One major issue is that Modulus and OpenShift use different environment variables for important settings like the port and ip address.  So, if you want code portable across both, you will need something like this in your node code:


function setupConfig(config) {
   if (process.env.OPENSHIFT_APP_DNS) {
      config.port = process.env.OPENSHIFT_NODEJS_PORT;
      config.ipAddress = process.env.OPENSHIFT_NODEJS_IP || '127.0.0.1';
      config.mongoURI = process.env.OPENSHIFT_MONGODB_DB_URL;
      config.isOpenshift = process.env.OPENSHIFT_APP_DNS;
   }
   else if (process.env.MODULUS_IAAS) {  // modulus
      config.port = process.env.PORT;
      config.ipAddress = '0.0.0.0';  // modulus doesn't need an ip
      config.mongoURI = process.env.MONGO_URI;
      config.isModulus = process.env.MODULUS_IAAS;
   }

   // possibly more here...
   
   return config;
}

And use these values when you create the server, i.e.

app.listen(config.port, config.ipAddress, function(){
  ...
});


I have the "isXXX" fields so that you can setup specific options like shutdown hooks.

For OpenShift you must change the package.json file to point to your main class.  OpenShift defaults to server.js, where most people use app.js.  Be sure to have the following lines in package.json with the correct name of your main file.

"scripts": {
    "start": "node app.js"
  },
"main": "app.js",

Finally, on a scaled platform, OpenShift (using the haproxy load balancer"pings" your app every two seconds, quickly filling up the log file with confusing junk.  There are even three (duplicate) bugs for this: 918783923141 and 876473.  Their suggested "fix" is to run a cron job calling rhc app-tidy once in a while to clear out your logs.  This fixes the "too much space" issue, but you still have a big problem using the log file, cause all this pings make it harder to see any real problems.  If you are brave, you could edit the haproxy.cfg file as hinted at (but not fully explained) in this StackOverflow post.  I chose an alternative.

My fix is to use Express to insert some middleware before the logger.  The "pings" can be recognized since they have no x-forwarded-for header.  Real requests should have that field, and that's also the value you want in the logfile.  At least, that works for me.

First, a function to ignore these pings and not call next().  Ever the fiddler, it is wrapped in another function so that it can still show a subset of the pings - you might want to see the pings every hour or so.


function ignoreHeartbeat(except) {
   except = except || 0;
   var count = 1;
   return function(req, res, next) {
      if (req.headers["x-forwarded-for"])
         return next();      // normal processing

      if (except > 0) {
         if (--count <= 0) {
            count = except;
            return next();
         }
      }
    
      res.end();
   }   
}
Then, in your app setup code, add this before you add the logger.  e.g. (Express 3 shown)


app.use(ignoreHeartbeat(1800));         // 1800 is once an hour
...
app.use(express.logger(myFormat));

Here's is example log data, where the ignoreHeartbeat was set to 10, so the pings should appear roughly every 20 seconds.  Note how the pings have no ip address.

Wed, 12 Nov 2014 22:08:36 GMT - - GET / 200 - 2 ms
Wed, 12 Nov 2014 22:08:56 GMT - - GET / 200 - 2 ms
Wed, 12 Nov 2014 22:08:59 GMT 50.174.189.32 - GET / 200 - 10 ms
Wed, 12 Nov 2014 22:08:59 GMT 50.174.189.32 - GET /javascripts/jquery-jvectormap-1.2.2.css 200 - 17 ms
  (more "real" GETs here...)
Wed, 12 Nov 2014 22:09:17 GMT - - GET / 200 - 3 ms
Wed, 12 Nov 2014 22:09:37 GMT - - GET / 200 - 1 ms

Monday, October 20, 2014

Web Scraping with node.js and Cheerio

I recently gave a talk at the BayNode Meetup, about my experiences web scraping for dog agility trials using node.js and the cheerio module.  The results are used for my website, www.nextq.info.

You can find the slides as Google Docs here:  Web scraping with cheerio.  Enjoy!


Wednesday, July 23, 2014

Groovy-Like XML for Java. Simple and Sane.

Parsing and navigating through XML in Java is a pain.  The org.w3c.dom.* classes are numerous, messy, and "old style", with no Collections, no Generics, no varargs.  XPath helps a lot with the navigation part, but is still a bit complex and messy.

Groovy, with XMLParser and XMLSlurper and their associated classes, makes this amazingly, dramatically easier.  Simple and Sane.  For example, Making Java Groovy Chapter 2 has an example to parse the Google geocoder XML data to retrieve latitude and longitude.  Below is the essentials of the code.  The full code, which is not much longer, is on GitHub here.

String url = 'http://maps.google.com/maps/api/geocode/xml?' + somemore...
def response = new XmlSlurper().parse(url)
stadium.latitude = response.result[0].geometry.location.lat.toDouble()
stadium.longitude = response.result[0].geometry.location.lng.toDouble()
The parsing is trivial, and navigating to the data (location.lat or location.lng) is also simple, following the familiar dot notation.

Can you do something anything like this in pure Java?  Not quite.  So I wrote a small library, xen, to mimic much of how Groovy does things.  The full Geocoder.java code is here, snippet below:


String url = BASE + URLEncoder.encode(address);
Xen response = new XenParser().parse(url);

Option 1: XPath slash style, 1 based indices
latLng[0] = response.toDouble("result[1]/geometry/location/lat");
latLng[1] = response.one("result[1]/geometry/location/lng").toDouble();

Option 2: Groovy dot style, 0 based indices
latLng[0] = response.toDouble(".result[0].geometry.location.lat");
latLng[1] = response.one(".result[0].geometry.location.lng").toDouble();
Pretty close, eh?

The main difference is that we can't use the dot notation directly from an object, but we can use a very similar slash notation based upon XPath syntax. If you use XPath notation, one major difference from Groovy is that array indices in W3C XPath are 1-based, not 0-based.  Therefore note that we access the 1st element of result, not the 0th.  However, if the "path" starts with a . and a letter, as in the final example, the path is treated as a Groovy / "dot notation" style, with 0-based indices.

So, if you want to greatly simplify parsing and navigating through XML, and/or you love how Groovy does things, please check out my (very beta!) xen library which allows you to do it in Java.  Currently it is compiled vs. Java 6 but I think it should be fine in Java 5.  So if you need to support some Android device, or can't or don't want to integrate Groovy into your Java projects, this could be very useful.

Xen library
JavaDocs
README

The README discusses various design decisions, particularly, how my design converged upon many aspects of the Groovy design.   More discussion will appear in later posts.  And, be warned, this is still a very early version, 0.0.2, so there are probably bugs, some mistakes, and upcoming API changes.


Node for Java Programmers

At a recent BayNode Meetup, I gave a 15 minute presentation on "Node for Java Programmers".  Mainly notes on common things I did wrong coming from the Java world, and ideas or idioms to deal with them.

I got some good feedback and positive responses, and recently edited the presentation.

Here is a link to it.   (on Google Docs).

Thursday, June 12, 2014

Coding by Convention is Great

... except when it isn't.

"Coding by Convention" (a.k.a. "Convention over Configuration") attempts to simplify programming by telling the programmer the preferred way to name or organize things.  It often saves a lot of time and hassle.  Without it, you write extra configuration files, typically in XML.  Spring and J2EE used to require way too much configuration, with lots of stupid redundancies, something like "When I say Foo bean I mean you to use a Foo.class, when they go to myCompany.com/order/books use the com.myCompany.order.books servlet".  On the other hand there can be too much convention - I've never used Maven, but hear that it is particularly dictatorial and hard to modify.

I'm developing a lot in JavaScript / node.js lately, and wanted the ability to save my data as either csv files or iCal (.ics) files.  Searching the NPM registry finds several candidates.

In csv files, the header line contains the name for each column.  If using convention, this would be taken from the property key.  And the property would map directly to the data in the following lines.  Sometimes you would want to change this.  Can you, or is it all done by convention?  As I understand their documentation:

to-csv      convention only
fast-csv    allows for transformation, but over an entire row
json-2-csv  convention only
json-csv    allows for flexible transformations

I ended up using json-csv, though one drawback of it's power is that it takes more work to use.

On the iCalendar side, the question is how to setup or create the complex VEVENT information.  One could use properties named DTSTART, UID, etc.  But it's extremely unlikely that your object has properties with those unusual and capitalized names, and the correct values.  Plus, DTSTART has a complex format with a possible "DATE-TIME" option and a time format of YYYYMMDDTHHMMSSZ.

cozy-iCal       builds VEvent objects programmatically
ical-generator  convention (uses .start .end for DTSTART DTEND)
icalevent       convention (also uses .start, .end etc)
icsjs           builds programmatically
icalendar       builds VEvent objects programmatically

So, it's a mishmash. And convention, while simple and convenient, doesn't always do the trick.  For example, in my CSV data I'd like to include the distance from the user's location.  This is obviously not even a field in the data, since it is calculated on the fly per user from the respective latitudes and longitudes.  The CSV should not include the latitude and longitude - not very useful to the end user.  My start and end dates are also not fields, they are stored in an array.  So I definitely can't use convention.

Unless...

The obvious work-around is to create a new temporary object, that meets the required convention, from the fields and data in the original, "real" object.  In many cases, you would just wrote custom code to do this, especially if speed is a concern.  There are also some modules that vaguely do this.  (Did I miss some???)  But they are pretty limited.   For example object-adapter can only copy values from a source object (renaming the fields), not apply any functions.

So, I wrote my own general-purpose module, remodeler.  For convenience there are copyKeys() and excludeKeys() methods for properties you simply want to copy as-is or ignore.  For the fancy stuff, you provide key / value pairs.  The key is the new property name, and the value is a "transformation".

If the transformation is a String, it means to copy the value from the old object, using the string as property name. For example, "UID", "uid" would mean to create a UID property by copying the previous uid property.

If the transformation is a function, it will be called via function(oldObject, key) and the result used as the new value.  In practice, the key argument is often ignored.  For example,

"SUMMARY", function(o,k) { return 'name:' + o.name + ' date:' + o.date[0]; }

would mean to create a new SUMMARY property by concatenating two existing values.

In many cases, you are still better off writing your own custom code.  On further thought, I'm not sure how useful my module will be, since in JavaScript, it is just so easy to go

var newthing = {
   newkey: oldObject.oldKey,
   ...
}

Other times you can follow the conventions, or use the programmatic interface.  However, if you want a quick way to "remodel" your domain object, this module might meet your needs.  Let me know what you think.  I think I'm going to try using this with ical-generator.






Thursday, April 3, 2014

I'm a Web Magnate

Well, one can dream.  :-)

I've spent the past few months developing a web site / app using a bunch of new (to me) stuff: node.js, Express, mongoDB, along with HTML5, CSS3, and a little JQuery for a cool map using jVectorMap.  All with JavaScript of course. Learned a lot along the way, and now feel somewhat competent with most of the technology.  It's even hooked into Google Maps, Google Analytics, and AddThis for social networking.

Check it out here at nextq.info.  What is it?  It is a search service to find upcoming dog agility trials.  My main hobby.

For hosting, I looked into a few cloud services such as Heroku and Nodejitsu, but decided to use Modulus.io.  So far I've been extremely pleased.
  • You get one month free, thereafter the "typical" $15 a month.
  • Their dashboard is cool-looking and, more importantly, fairly useful.
  • Creating and hooking up a small Mongo Database is simple, even a newbie like me can do it.
  • Uploading your code is simple, even a newbie like me can do it.
  • Linking their site to my domain name server was fairly straightforward, even a... (you get the idea).
  • On a crash it will send you an email (easy to configure) or text, then restarts.
There were a couple of minor gotchas.  I forgot to route both nextq.info and www.nextq.info.  It took me a little while to learn about the .modulusignore file that tells the upload to ignore unnecessary or test files.  (Though it is smart enough to ignore your node_modules folder.)

So far so good.  I especially enjoyed making up agility-centric custom 404 (and 400 and 500) pages.  I'm finding that jVectorMap is a little flaky on some devices, so there is also a much simpler, somewhat "mobile friendly" search page.  But the main parts seem to work fine and some of my agility friends are already using it.

Sunday, January 19, 2014

Yet Another Example Demo Node.js / Express Project

I'm working on an app that finds upcoming events and locations, and wanted a quick link bring up the weather for that city.  Which takes a little work.  What I came up with make a very nice little node.js / express "example" or "demo" program.

Thought I'd try it as a Google Docs presentation.

Source code is on GitHub.

I'll likely add some more commentary here in the near future.