Steps to write a UDF to implement in PIG
Step 1: In your Eclipse, create a new class with apprpriate name
Step 2: Import all the necessary Hadoop Jars in the Eclipse environment to resolve dependencies.
To do so in Eclipse:
Right Click on project –> Build Path –> Configure Build Path –> Libraries –> Add External Jars –> Select all the necessary JAR files from the Hadoop Libs folder –> Click Ok.
Step 3: Your class should extend EvalFunc as shown in the example script
Eg: public class calculateAge extends EvalFunc<String>{
Step 4: Write the business logic for the UDF in exec() method
Step 5: Once your code is ready, export it to a .jar file in Eclipse.
To do so in Eclipse:
File –> Export –> Java –> JAR file –> select the project to be jarred and provide a vaid name for the JAR file –> Finish
Step 6: Place the JAR file such that it can be accessable in your PIG script.
Step 7: Now that the JAR file is ready, Register the JAR file in your PIG script using REGISTER operator
REGISTER /com/customeUDFs/CalculateAge.jar
Step 8: Define an alias for the UDF instead of using the entire path to UDF in the script
This is done using the DEFINE operator.
>> Eg: DEFINE
Step 9: Use the UDF in the PIG script
Eg: /*
APPLY CUSTOM UDF ON THE DOB FIELD FIELD TO CALCULATE AGE
*/
udf_usage = FOREACH load_data
GENERATE
id,
name,
dob,
calculateAge(dob) AS age:long;
Note: Sample Script implementing the usage of UDFs and a sample UDF in JAVA is also given below
Pig Script implementing UDF
/* Problem Statement: Implementing a custom UDF to calculate Age from the given Date Of Birth Features Explained: REGISTER EXTERNAL UDF JAR FILE DEFINE ALIAS FOR THE CUSTOM UDF IMPLEMENTING THE UDF IN THE SCRIPT */ /* REGISTER THE JAR FILE OF YOUR UDF BEFORE USING IT IN THE SCRIPT */ REGISTER /com/sample/udfs/sample.jar; /* FOR BETTER READABILITY DEFINE AN ALIAS OF THE UDF; WHICH CAN BE USED IN THE ENTIRE FLOW OF YOUR SCRIPT */ DEFINE sample_udf sample_udf(); /* LOAD DATA FROM INPUT FILE */ load_data = LOAD '/com/input_src/input_file.csv' USING PigStorage(',') AS (id: int, name:chararray, dob:chararray, score:int, city:chararray ); /* APPLY CUSTOM UDF ON THE DOB FIELD FIELD TO CALCULATE AGE */ udf_usage = FOREACH load_data GENERATE id, name, dob, calculateAge(dob) AS age:long; STORE udf_usage INTO '/com/output/dir/' USING PigStorage(',') ;
UDF in Java
import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; import java.text.DateFormat; import java.text.ParseException; import java.text.SimpleDateFormat; import java.time.Duration; import java.util.Date; import java.util.concurrent.TimeUnit; public class CalculateAge extends EvalFunc<String>{ public long exec(Tuple dob) throws IOException { // Put in your business logic of UDF here if(dob == NULL) { return 0; } else { DateFormat dt = new SimpleDateFormat("dd-MM-yyyy"); Date dateOfBirth = dt.parse(dob); Date today = new Date(); long diff = today.getTime() - dateOfBirth.getTime(); long diffHours = diff / (60 * 60 * 1000 ); long diffYears = diffHours / 8760; return diffYears; } } }