Predicting User Ratings on Yelp

Several weeks ago, Yelp launched a dataset challenge and released data on their users and businesses in Phoenix, AZ. The challenge shared some similarities with the Netflix Prize, and I was curious to see if the Alternating Least Squares algorithm that I used for my Netflix project was versatile enough to be used for predicting business ratings on Yelp. The results were disappointing, but I shall document the process here for my own reference. All of the code used is available in a git repository.

Transforming the Data

The dataset arrived in a tarball that unpacks into a directory of JSON files. Each JSON file could be imported into MongoDB relatively easily with the following command:

$> mongoimport --db yelp --collection reviews --drop --file <filename>

However, since Yelp mangled the primary keys before exporting the dataset, we need to change the primary keys to consecutive integers before we can import the data into MatLab as a sparse matrix. This can be done using the following script (also available from the git repository):

The reviews collection may be exported to a CSV file that can be imported into MatLab.

Decomposing the Matrix

Using a latent factor model, we assume that each business has a set of features, and each user has a certain preference for each feature. Then, the rating that a user gives to a business r_ij is the dot product of the business's feature vector b_i and the user's preferences vector u_j. Putting everything together, we get B × U = R, where B is the features matrix, U is the preferences matrix, and R is the ratings matrix.

The goal of the alternating least squares algorithm is thus to determine the matrices B and U, and then use these for predicting unknown ratings. (The full explanation of the model and the algorithm are available in my project report.)

Note that the model doesn't know exactly what these features are - it just guesses at a reasonable number of features.

Measuring the Results

The entire ratings matrix R was too large for my Macbook to handle, and I took a random sample of 2000 users and 2000 businesses, which gave me about 1780 known ratings. The model has two parameters: the number of latent factors, and the regularization factor (lambda). I fixed the number of latent factors at 20, and then used 5-fold cross validation to determine the best value for lambda.

The results were as follows:

Algorithm RMSE
Plain ALS 1.4688
ALS w. Bias 1.2365
ALS w. Correction 1.5045

Possible Improvements

The results were disappointing, given that the standard deviation of my sample was only 1.2077. In other words, ALS did worse than simply predicting the mean rating all the time. Other than fixing possible bugs in my code (maybe I somehow permuted the data?), I could potentially modify the algorithm as follows:

  • Take into account known features and estimate user preferences for these.
  • Use a clustering algorithm on the locations of the businesses, since a user's preference for a business is likely to be affected by its location.

I will post updates here if I figure out what is going wrong.

Tagged , ,

Deployment is Scary - Part 1

As part of a class project, I was given the opportunity to field a Rails application at Spring Carnival, an annual event held by Carnegie Mellon University (official press release.) Here, I will describe a subset of the risk mitigation strategies that we used to avoid hacking and maximize uptime.

In the first part of this series, I will focus on our use of JMeter for some simple load testing.

Amazon Elastic Beanstalk

The application was deployed on Amazon's Elastic Beanstalk, which has auto-scaling built in. However, this also meant that the application could have several web servers working off a single RDS server, which could cause some performance issues. Our goals for load testing were two-fold: first, we wished to ascertain the effectiveness of our auto-scaling policies; second, we wanted to ensure that the response times were reasonable even at twice the maximum expected load.

JMeter

JMeter was not the simplest tool for load testing, but developers (and ops engineers) seemed to agree that it was amongst the most powerful. The user manual provides detailed instructions for building a web test plan. In addition to a thread group and a config element for HTTP request defaults, we also added a HTTP cookie manager for maintaining sessions. The following screenshot captures the hierarchy of our setup.

image

Extract CSRF Token

Before our test user could login, it had to find the CSRF token on the page and submit that along with its credentials for Rails to accept the request. To do this, we added a HTTP Request to our thread group (or, more accurately, a request sampler). This sampler issues GET requests to /users/sign_in, and, using a Regular Expression Extractor (available under Post Processors), searches for the CSRF token with the following regex

input name="authenticity_token" type="hidden" value="(.*?)"

The matched token is stored in the LOGIN_AUTH_TOKEN variable for JMeter to use in subsequent requests.

Authenticate User

Next, we added another request sampler that issues POST requests to /users/sign_in with the user's email, password, and the CSRF token from the previous step. (I forgot to encode the CSRF token and spent a great amount of time wondering why Rails occasionally rejected requests from this sampler.) The following screenshot showed our configuration.

image

Do Something

Finally, we added request samplers to perform some action on the application. In my case, I was mainly worried about I/O performance, and thus added actions that would force read/writes to the database. Note that if some of these actions require CSRF tokens, in which case one could use the CSRF extraction described above and a HTTP Header Manager to include the X-CSRF header in the request. We also found it useful to include a response assertion to verify the status code of the response.

Measure Response Times

After some tinkering (such as using eager loading to reduce the number of database calls), we managed to improve the application's performance to (what I thought was) a reasonable level.

The entire JMX file is available here.

image

Tagged ,

Numbering the Last Line of Align* in LaTex

On numerous occasions, I find myself trying to number the last line (and only the last line!) of a long series of derivations in an align* environment. I usually switch to the align environment and use \notag to turn off the number on each line except the last.

There is a better solution, offered by egreg on StackOverflow:

By using equation and aligned, we get an environment similar to align* that only numbers the last line.

Tagged

Some Custom Matchers for RSpec with Shoulda

It should have a valid factory

I use Factory Girl in my tests, and I have found it useful to check that I have valid factories for each model in my tests/specs. Here is a matcher you can use for that purpose.

Add this file to spec/support. Then in your specs, just use

describe User do
  it { should have_a_valid_factory }
end

It should validate the existence of X

In some of my models, I have validations that ensure existence of a given foreign key i.e. the foreign key must refer to an existing record. You can use the validates_existence gem, or write your own. Here is a matcher for validates_existence_of.

In your specs, just use

describe Membership do
  it { should validate_existence_of :user }
end
Tagged , , ,

Deploying a Thrift Server with Capistrano

At the time of writing, I couldn't find an existing Capistrano recipe for deploying thrift servers. So here is an example that works with ubuntu.

Before you use the recipe, don't forget to update the configuration options to match your server setup. At the very least, you will need to update the following:

  • :application
  • :repository
  • :user
  • :super_user
  • :app_port
  • role :app

Note that the recipe assumes a system-wide rvm install. :user does not need to be a sudoer, but :super_user must be. This is arranged as such for security purposes, since the server process will be owned by :user. The super user is used to create the Upstart job.

Finally, execute

$ cap deploy:setup
$ cap deploy

to setup and deploy your thrift server.

Upstart

During cap deploy:setup, the recipe creates an Upstart job using the template at config/thrift_app.conf. This allows Upstart to launch and monitor your thrift server. You can use other monitoring tools if you like. Just update the deploy:setup task appropriately.

Thor

To play with the Thrift server, launch it as follows:

$ bundle exec bin/my-application server

Then create a new client as follows:

$ bundle exec bin/my-application client
> ping
pong!
Tagged , ,

Monopoly Heat Map

I am working on a Monopoly manual for my professional writing class, and did a heat map showing the relative probabilities of landing on each square of the Monopoly board. It's done using canvas on HTML5.

Here is a preview:

Monopoly Heat Map

Check it out here.

Tagged , ,

Passwordless Kerberized SSH for CMU's UNIX machines

If you use SSH frequently to access unix.andrew.cmu.edu, you must have tried at some point to set up passwordless login using public/private keys. That didn't work for me, and when I did ssh -vv unix.andrew.cmu.edu, it always failed at

debug2: we did not send a packet, disable method

and then asked me for a password. After much trial and error (including locking myself out of my account), I finally uncovered the secret.

Note: this tutorial is for Mac OS X only, although you might be able to configure a similar setup on Linux.

Setup A Few Things

First, let's get a basic things out of the way. These are optional, but I will assume that you have these for the rest of the tutorial.

  1. Download and install Mac OS X Kerberos Extras from MIT.
  2. Setup a default SSH user for the unix.andrew.cmu.edu domain, by ensuring that there is a file at ~/.ssh/config that contains the following:

~/.ssh/config

Host unix.andrew.cmu.edu
    User your_andrew_ID

Configure Kerberos

Instead of using public/private keys for passwordless login, we will use kerberos tickets.

First, copy the configuration file from the UNIX machines to your local machine.

$> scp unix.andrew.cmu.edu:/etc/krb5.conf krb5.conf
$> sudo mv krb5.conf /etc/krb5.conf

Next, try to obtain a kerberos ticket for your andrew account, then try to login.

$> kinit my_andrew_id@ANDREW.CMU.EDU
# enter your password when prompted ...
$> ssh unix.andrew.cmu.edu

The last step should succeed without asking you for your password.

Renew Daily

Once you have completed the above configuration, at the start of each day, issue the following command:

$> kinit my_andrew_id@ANDREW.CMU.EDU
# enter your password when prompted ...

And you should be able to ssh into unix.andrew.cmu.edu for the rest of the day without a password. The tickets expire after 11 hours.

Store Password in Keychain Access

If you like, you can keep your password in keychain access by using the following command:

$> security add-generic-password -a "my_andrew_id" \
    -s "ANDREW.CMU.EDU" \
    -w "mypasswd" \
    -c "aapl" \
    -T "/usr/bin/kinit"

Entering passwords in plain text on the command line may scare some people. You can leave that out and enter it manually in the KeyChain Access app.

Once that is done, kinit will get your password from KeyChain Access automatically.

Done!

Now you can git pull and git push like a pro without that pesky password prompt.

Advanced: Auto-Renewal

There might be a way for you to automatically renew kerberos tickets when they expire. It's not too annoying for me to enter my password once a day, so I will leave that to you.

Tagged , , ,

Backbone, Faye, and Reaction

I was jealous of what MeteorJS did for NodeJS developers, and decided to steal some of their ideas for Rails. The result is the reaction gem, and a demo app can be downloaded from here.

Transparent Synchronization

Using Reaction, you write your app like a usual Backbone app. However, instead of using Backbone.Collection, use Reaction.Collection to create a synchronized collection.

var collection = new Reaction.Collection({
  controller_name: 'todos',
  model_name: 'todo'
});

The collection provides a custom sync function that routes changes to the conventional CRUD endpoints of the Rails application. For example, when you invoke collection.create({...}, {wait:true}), it makes a POST request to the Rails app.

When updates are pushed from the Rails server to the client, the collection is updated and add, remove, or change events are fired. For example, a view may react to an add event as follows:

collection.bind('add', this.addOne, this);
// ...
addOne: function(obj) {
  $('x').append(new XView(obj).render().el);
}

Local HTML5 Storage

Objects in the collection are cached in the local HTML5 storage. If the client is online, this helps to reduce the initial time required to load the interface; if the client is offline, this allows the collection to be re-created from the cache, showing the user a decent (but outdated) interface.

Separate App Server and Push Server

For scaling purposes, the app server can be separated from the push server. This also allows multiple app servers to share the same push server. When these servers are started, a shared secret must be provided. This secret will be used to sign messages sent from the app servers to the push server.

If you wish to learn more, head to the quickstart.

Tagged , , ,

Hello world

This is my first blog post using Pelican.

I first tried using Wordpress, then migrated to Tumblr. Hope this works better! Old posts are still accessible at tumblr. This one is starting anew.

Tagged ,