Implementing UDFs in PIG

Steps to write a UDF to implement in PIG

Step 1: In your Eclipse, create a new class with apprpriate name

Step 2: Import all the necessary Hadoop Jars in the Eclipse environment to resolve dependencies.
To do so in Eclipse:
Right Click on project –> Build Path –> Configure Build Path –> Libraries –> Add External Jars –> Select all the necessary JAR files from the Hadoop Libs folder –> Click Ok.

Step 3: Your class should extend EvalFunc as shown in the example script
Eg: public class calculateAge extends EvalFunc<String>{

Step 4: Write the business logic for the UDF in exec() method

Step 5: Once your code is ready, export it to a .jar file in Eclipse.
To do so in Eclipse:
File –> Export –> Java –> JAR file –> select the project to be jarred and provide a vaid name for the JAR file –> Finish

Step 6: Place the JAR file such that it can be accessable in your PIG script.

Step 7: Now that the JAR file is ready, Register the JAR file in your PIG script using REGISTER operator
REGISTER /com/customeUDFs/CalculateAge.jar

Step 8: Define an alias for the UDF instead of using the entire path to UDF in the script
This is done using the DEFINE operator.
>> Eg: DEFINE

Step 9: Use the UDF in the PIG script
Eg: /*
APPLY CUSTOM UDF ON THE DOB FIELD FIELD TO CALCULATE AGE
*/
udf_usage = FOREACH load_data
GENERATE
id,
name,
dob,
calculateAge(dob) AS age:long;

Note: Sample Script implementing the usage of UDFs and a sample UDF in JAVA is also given below

Pig Script implementing UDF

/*
Problem Statement: Implementing a custom UDF to calculate Age from the given Date Of Birth

Features Explained: REGISTER EXTERNAL UDF JAR FILE
                    DEFINE ALIAS FOR THE CUSTOM UDF
                    IMPLEMENTING THE UDF IN THE SCRIPT 
*/

/*
REGISTER THE JAR FILE OF YOUR UDF BEFORE USING IT IN THE SCRIPT
*/
REGISTER /com/sample/udfs/sample.jar;

/*
FOR BETTER READABILITY DEFINE AN ALIAS OF THE UDF; WHICH CAN BE USED IN THE ENTIRE FLOW OF YOUR SCRIPT
*/
DEFINE sample_udf sample_udf();

/*
LOAD DATA FROM INPUT FILE
*/
load_data = LOAD '/com/input_src/input_file.csv' USING PigStorage(',') 
                AS (id: int,
                    name:chararray,
                    dob:chararray,
                    score:int,
                    city:chararray
                   );
/*
   APPLY CUSTOM UDF ON THE DOB FIELD FIELD TO CALCULATE AGE
*/
udf_usage = FOREACH load_data 
                GENERATE 
                    id,
                    name,
                    dob,
                    calculateAge(dob) AS age:long;

STORE udf_usage INTO '/com/output/dir/' USING PigStorage(',') ;

UDF in Java

import java.io.IOException; 
import org.apache.pig.EvalFunc; 
import org.apache.pig.data.Tuple; 
 
import java.io.IOException; 
import org.apache.pig.EvalFunc; 
import org.apache.pig.data.Tuple;


import java.text.DateFormat;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.time.Duration;
import java.util.Date;
import java.util.concurrent.TimeUnit;


public class CalculateAge extends EvalFunc<String>{ 

   public long exec(Tuple dob) throws IOException {   
      // Put in your business logic of UDF here

        if(dob == NULL)
        {
            return 0;
        }
        else
        {
            DateFormat dt = new SimpleDateFormat("dd-MM-yyyy");
            Date dateOfBirth = dt.parse(dob);
            
            Date today = new Date();
            
            long diff = today.getTime() - dateOfBirth.getTime();
            long diffHours = diff  / (60 * 60 * 1000 );
            long diffYears = diffHours / 8760;

            return diffYears;
        }

   } 
}

Pig Script implementing UDF

UDF in Java

Share this: