Common Pig One Liners

As with any programming language, there is a bit of a learning curve with Pig. So here are a few common items that I found useful. If you know Pig, please feel free to add your own in the comments section.

When it comes to Pig, there is a “filter early, filter often” approach that is preached and practiced. So some of these may be more than one line, but either way, they are short. These have all been tested only on Pig 0.6 on Amazon’s Elastic Map Reduce version of Pig. Since they are simple, they should be fairly portable. As one would expect, these are contrived examples.

  • Count all the items in a bucket. The SQL equivalent being: SELECT COUNT(*) FROM foo.
    -- Assuming that 'visits' contains all visits to your website (for example)
    -- Returns: (100L)
    total_visits = FOREACH (GROUP visits ALL) GENERATE COUNT($1);
  • Grouping on multiple elements in a bag. Assuming you have a bag with 4 tuples that looks like this: (1,Football),(2,Soccer),(1,Soccer),(2,Soccer). You may want to know how many of user type 1 are “Football” or “Soccer” and how many of user type 2 are “Football” or “Soccer”. Note: If you want user_type and sport in a separate bag, just remove the FLATTEN($0).
    -- Group by user_type and then by sports interest
    -- Returns: (1,Football,1L),(1,Soccer,1L),(2,Soccer,2L)
    --          {group::user_type: chararray,group::sport: chararray,total: long}
    sports_interests_by_user_type = FOREACH (GROUP user_type BY ((CHARARRAY)$0, (CHARARRAY)$1)) GENERATE FLATTEN($0), COUNT($1) AS total;
  • Add a field to a every element in a bag. From my understanding, this next bit is a Pig 0.6ism. This will join each by 1 thus creating a tuple with an implicit join of 1. The outcome will be a similar effect to an array push of a field onto the end of every tuple in a bag.
    -- Add total visits to every sports_interest_by_user_type
    -- Returns: (2,Soccer,2L,100L)
    --          {sports_interests_by_user_type::group: chararray,sports_interests_by_user_type::total: long,long}
    sports_interests_by_user_type_fraction = JOIN sports_interests_by_user_type BY 1, total_visits BY 1;
  • Let’s take field that we added to the end of the tuple and get a percentage out of it. This will return the total out of 100%.
    -- Divide the number of user_types per interest by the total
    -- Returns: (2,Soccer,1.0F)
    --          {sports_interests_by_user_type::group::wv: chararray,sports_interests_by_user_type::group::area: chararray,float}
    sports_interests_by_user_type_percent = FOREACH sports_interests_by_user_type_fraction GENERATE (CHARARRAY)$0, (CHARARRAY)$1, (FLOAT)(((FLOAT)$2 / (FLOAT)$3) * 100);

This post is another example of work that I could not have accomplished without the help of people on #hadoop-pig on Also worthy of note are the Pig Latin manuals here and here.

Posted in Hadoop. Tags: , . No Comments »