Pig-cookbook
Table of contents
1 Overview............................................................................................................................2 2 Performance Enhancers......................................................................................................2
Copyright © 2007 The Apache Software Foundation. All rights reserved.
Pig Cookbook
1. Overview
This document provides hints and tips for pig users.
2. Performance Enhancers
2.1. Use Optimization
Pig supports various optimization rules which are turned on by default. Become familiar with these rules.
2.2. Use Types
If types are not specified in the load statement, Pig assumes the type of =double= for numeric computations. A lot of the time, your data would be much smaller, maybe, integer or long. Specifying the real type will help with speed of arithmetic computation. It has an additional advantage of early error detection.
--Query 1 A = load 'myfile' as (t, u, v); B = foreach A generate t + u; --Query 2 A = load 'myfile' as (t: int, u: int, v); B = foreach A generate t + u;
The second query will run more efficiently than the first. In some of our queries with see 2x speedup.
2.3. Project Early and Often
Pig does not (yet) determine when a field is no longer needed and drop the field from the row. For example, say you have a query like:
A B C D E = = = = = load 'myfile' as (t, u, v); load 'myotherfile' as (x, y, z); join A by t, B by x; group C by u; foreach D generate group, COUNT($1);
There is no need for v, y, or z to participate in this query. And there is no need to carry both t and x past the join, just one will suffice. Changing the query above to the query below will greatly reduce the amount of data being carried through the map and reduce phases by pig.
Page 2
Copyright © 2007 The Apache Software Foundation. All rights reserved.
Pig Cookbook
A = load 'myfile' as (t, u, v); A1 = foreach A generate t, u; B = load