How to make my Postgres Query faster using Apache Spark?2019 Community Moderator ElectionHow to reset postgres' primary key sequence when it falls out of sync?PostgreSQL: How to make “case-insensitive” queryHibernate + Postgres: simple query with bad performanceApache SPARK with SQLContext:: IndexError'where' in apache sparkApache Spark OutOfMemoryError (HeapSpace)How to make good reproducible Apache Spark examplesApache Spark Pivot Query Stuck (PySpark)pyspark.sql.utils.ParseException: u"nextraneous > input 'xxx' expecting ')', ','Pyspark Dataframe API or Spark.sql for joins?

Is divide-by-zero a security vulnerability?

Why couldn't the separatists legally leave the Republic?

Getting the || sign while using Kurier

Was it really inappropriate to write a pull request for the company I interviewed with?

Source permutation

Making a kiddush for a girl that has hard time finding shidduch

What is the generally accepted pronunciation of “topoi”?

Why is gluten-free baking possible?

What is Tony Stark injecting into himself in Iron Man 3?

How exactly does an Ethernet collision happen in the cable, since nodes use different circuits for Tx and Rx?

Proving a statement about real numbers

What would be the most expensive material to an intergalactic society?

What are some noteworthy "mic-drop" moments in math?

How can I find out information about a service?

After `ssh` without `-X` to a machine, is it possible to change `$DISPLAY` to make it work like `ssh -X`?

Windows Server Data Center Edition - Unlimited Virtual Machines

In the late 1940’s to early 1950’s what technology was available that could melt a LOT of ice?

Finitely many repeated replacements

From an axiomatic set theoric approach why can we take uncountable unions?

Outlet with 3 sets of wires

Does an unused member variable take up memory?

Are small insurances worth it?

What is this diamond of every day?

Trig Subsitution When There's No Square Root



How to make my Postgres Query faster using Apache Spark?



2019 Community Moderator ElectionHow to reset postgres' primary key sequence when it falls out of sync?PostgreSQL: How to make “case-insensitive” queryHibernate + Postgres: simple query with bad performanceApache SPARK with SQLContext:: IndexError'where' in apache sparkApache Spark OutOfMemoryError (HeapSpace)How to make good reproducible Apache Spark examplesApache Spark Pivot Query Stuck (PySpark)pyspark.sql.utils.ParseException: u"nextraneous > input 'xxx' expecting ')', ','Pyspark Dataframe API or Spark.sql for joins?










-1















How to minimize my query execution time using pyspark?



I am using Postgres Database,
And spark installed in my local machine having 10GB RAM



Query Execution time in PgAdmin - 10 Sec



Query Execution time in Pyspark - 10 Sec



Find below is my pyspark code



from pyspark.sql import DataFrameReader

url = "jdbc:postgresql://168.23.233.4:5432/MyDatabase"
properties =
"driver": "org.postgresql.Driver",
"user": "postgres",
"password": "123"


df = sqlContext.read.jdbc(url=url,table="(select.. very big query limit 10) AS t", properties=properties)
df.show()


Query has to join more than 13 tables each table has 1 million rows.



Please help me to faster query using Spark.



I have try this based on this blog enter link description here.



Find Below query running inside the pyspark code,



select '2019-02-27' as "Attendance_date",e.id as e_id,concat(e.first_name::text, e.last_name::text) as "Employee_name",e.emp_id as "Employee_id",
e.user_id as "User_id",e.customer_id,att.id attendance_id, al.id as Attendance_logs_id,aa1.id as attendance_approval_id,
e.client_emp_id as "Client_employee_id", e.contact_no as "Contact_no",
att.imei as ImeiNumber,e.email_id as "Email_id",
concat(man.first_name::text, man.last_name::text) as "Manager_name", man.id as "Manager_id",
att.Uniform,att.Samsung_Logo,att.Blue_Color_Check,att.Blue_Color_Percentage,
al.Face_Detection_Flag,rl.role_name as "Role_name",b.branch_name as "Branch_name",
b.branch_code as "Branch_code",cty.city_name as "City",sm.state as "State",
gsv1.name as "Geo_Country",gsv2.name as "Geo_State"
,sh.shift_name as "Shift_name",sh.id as shift_id
,((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME)
as "Check_in_time"
,al.check_in_lat as "Check_in_latitude", al.check_in_long as "Check_in_longitude",
(select string_agg(value, ', ') from json_each_text(al.check_in_address::json))as "Check_in_address",
att.check_in_late as "Check_in_late_remarks",al.check_in_distance_variation as "Check_in_distance",al.check_in_selfie as "Check_in_selfie",
case when @aa1.approval_flag = 2 then ch_in.attendance_reason end as "Check_in_rejection_remarks",qc_ch_in.attendance_reason
as "Check_in_qc_review",
case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved'
when @aa2.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 2 THEN 'Rejected' when @aa2.approval_flag = 0
THEN 'Pending' else null END as "Check_out_status",
case when att.attendance_type='P' and @att.approval_flag = 1 or att.attendance_type='L' and el.approval_flag=1 or
att.attendance_type='H' and eh.approval_flag=1 or att.attendance_type='M' and em.approval_flag=1 or
att.attendance_type='W' and ew.approval_flag=1 then 'Approved'
when att.attendance_type='P' and @att.approval_flag = 0 or att.attendance_type='L' and el.approval_flag=0 or
att.attendance_type='H' and eh.approval_flag=0 or att.attendance_type='M' and em.approval_flag=0 or
att.attendance_type='W' and ew.approval_flag=0 then 'Waiting for Approval'
when att.attendance_type='P' and el.approval_flag=2 or att.attendance_type='H' and eh.approval_flag=2 or
att.attendance_type='M' and em.approval_flag=2 or att.attendance_type='W' and ew.approval_flag=2 then 'Rejected'
when att.attendance_type='P' and @att.approval_flag is null then '' else 'Waiting for Approval' end as "TL approval status",
case when att.attendance_type='P' then 'Marked'
when att.attendance_type='L' then 'Marked' when att.attendance_type='HL' or att.attendance_type='HP' then 'Marked'
when att.attendance_type='W' or ewo.weekoff_id is not null then 'Marked' when (e.customer_id is null and cehv.id is not null)
or (e.customer_id is not null and ehv.id is not null) then 'Holiday'when el.employee_id is not null then 'Marked'
when eh.employee_id is not null then 'Marked' when ew.employee_id is not null then 'Marked'
when em.employee_id is not null then 'Marked' else 'Not Marked' end "Status",
case when att.attendance_type='P'
then check_in.attendance_reason when att.attendance_type='L' then lt.leave_type_name when el.employee_id is not null
then lt.leave_type_name when att.attendance_type='HL' or att.attendance_type='HP' then 'Half Day'
when att.attendance_type='W' or ewo.weekoff_id is not null then 'Week off' when (e.customer_id is null and cehv.id is not null
) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when eh.employee_id is not null then 'Holiday' when
ew.employee_id is not null then 'Week off' when em.employee_id is not null then 'Marketoff' else 'Absent'
end as "Attendance_reason",
case when att.on_behalf_attendance is not null then concat(man_behalf.first_name::text,
man_behalf.last_name::text) else null end as "Onbehalf_name",att.Check_Out_Qc_Review,att.Check_Out_Distance,
al.Check_Out_Address
from employees e
left join employee_applied_holidays eh on eh.employee_id=e.id and date('2019-02-27') between eh.from_date and eh.to_date
left join employee_applied_weekoffs ew on ew.employee_id=e.id and date('2019-02-27') between ew.from_date and ew.to_date
left join employee_applied_marketoffs em on em.employee_id=e.id and date('2019-02-27') between em.from_date and em.to_date
inner join users u on u.ref_id = e.id and u.customer_id=200
inner join user_role_groups urg on u.id = urg.user_id and urg.active_flag = 1
inner join attendance_setups ass on ass.role_group_id = urg.role_group_id
left join attendances att on att.employee_id = e.id and att.start_date = '2019-02-27' and att.delete_flag = 0

left join employee_leaves el ON el.id=(select id from employee_leaves el2 where el2.employee_id=e.id and
el2.active_flag=1 and date('2019-02-27') between el2.from_date and el2.to_date order by id desc limit 1)

left join leave_types lt ON lt.id=(select leave_type from employee_leaves el where el.employee_id=e.id and
el.active_flag=1 and date('2019-02-27') between el.from_date and el.to_date order by id desc limit 1)

left join attendance_logs al on al.attendance_id = att.id and al.attendance_flag = 1
left join attendance_approvals aa1 on al.id = aa1.attendance_log_id and aa1.action = 1 and aa1.active_flag = 1
left join attendance_approvals aa2 on al.id = aa2.attendance_log_id and aa1.action = 2 and aa2.active_flag = 1
inner join branches b on b.id = e.branch_id left join employees man on man.id = e.manager_id
left join employees man_behalf on man_behalf.id = att.on_behalf_attendance
left join employee_weekoff ewo on e.id = ewo.emp_id and date_part('dow','2019-02-27'::TIMESTAMP)+1 = ewo.weekoff_id and
ewo.active_flag =1 left join employee_holidays_view ehv on e.id = ehv.id and ehv.holiday_date = '2019-02-27'
left join company_employee_holidays_view cehv on e.id = cehv.id and ehv.holiday_date = '2019-02-27'
inner join roles rl on rl.id = e.role_id inner join cities cty on cty.id = b.city_id
inner join states on states.id = b.state_id inner join state_master sm on sm.id = states.state_id
inner join countries on countries.id = b.country_id inner join country_master cm on cm.country_id = countries.country_id
left join shifts sh on sh.id = att.shift_id left join attendance_reasons ch_in on ch_in.id = aa1.reason_id
left join sessions se on sh.id=se.shift_id left join attendance_reasons ch_out on ch_out.id = aa2.reason_id
left join attendance_reasons qc_ch_in on qc_ch_in.id = att.check_in_qc_review
left join attendance_reasons qc_ch_out on qc_ch_out.id = att.check_out_qc_review
left join attendance_reasons check_in on check_in.id = al.reason_id
left join time_zones tz on b.timezone = tz.time_zone inner join geo_outlet_mapping gom
on b.id = gom.outlet_id left join geo_structure_values gsv1 on gsv1.id = gom.level1 left join
geo_structure_values gsv2 on gsv2.id = gom.level2 left join geo_structure_values gsv3 on gsv3.id = gom.level3
where e.customer_id=200

group by concat(e.first_name::text, e.last_name::text) ,e.emp_id ,e.user_id ,e.client_emp_id , e.contact_no , e.email_id,e.profile_picture,(select string_agg(role_group_name, ', ') from role_group where role_group_id = any((select array_agg(role_group_id) from user_role_groups where user_id = u.id and active_flag = 1)::int[])),concat(man.first_name::text, man.last_name::text), rl.role_name,b.branch_name,b.branch_code,cty.city_name,sm.state,cm.country,gsv1.name,gsv2.name,case when @ass.reference_point = 1 THEN b.latitude else e.latitude END,case when @ass.reference_point = 1 THEN b.longitude else e.longitude END,sh.shift_name,sh.start_time, sh.end_time,((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME),case
when current_date='2019-02-27' and sh.end_time<cast(current_time as time without time zone) then null else (case when se.check_out_flag=1 then cast(att.total_hours as interval) when se.check_out_flag=0 then sh.end_time-((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME) end) end,al.check_in_lat, al.check_in_long,(select string_agg(value, ', ') from json_each_text(al.check_in_address::json)),att.check_in_late,al.check_in_distance_variation,al.check_in_selfie,case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved' when @aa1.approval_flag = 1 THEN 'Approved' when @aa1.approval_flag = 2 THEN 'Rejected' when @aa1.approval_flag = 0 THEN 'Pending' else null END,case when @aa1.approval_flag = 2 then ch_in.attendance_reason end,qc_ch_in.attendance_reason,case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 2 THEN 'Rejected' when @aa2.approval_flag = 0 THEN 'Pending' else null END,
case when att.attendance_type='P' and @att.approval_flag = 1 or att.attendance_type='L' and el.approval_flag=1 or
att.attendance_type='H' and eh.approval_flag=1 or att.attendance_type='M' and em.approval_flag=1 or
att.attendance_type='W' and ew.approval_flag=1 then 'Approved'
when att.attendance_type='P' and @att.approval_flag = 0 or att.attendance_type='L' and el.approval_flag=0 or
att.attendance_type='H' and eh.approval_flag=0 or att.attendance_type='M' and em.approval_flag=0 or
att.attendance_type='W' and ew.approval_flag=0 then 'Waiting for Approval'
when att.attendance_type='P' and el.approval_flag=2 or att.attendance_type='H' and eh.approval_flag=2 or
att.attendance_type='M' and em.approval_flag=2 or att.attendance_type='W' and ew.approval_flag=2 then 'Rejected'
when att.attendance_type='P' and @att.approval_flag is null then '' else 'Waiting for Approval' end,
case when att.attendance_type='P' then 'Marked' when att.attendance_type='L' then 'Marked'
when att.attendance_type='HL' or att.attendance_type='HP' then 'Marked' when att.attendance_type='W'
or ewo.weekoff_id is not null then 'Marked' when (e.customer_id is null and cehv.id is not null) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when el.employee_id is not null then 'Marked' when eh.employee_id is not null then 'Marked' when ew.employee_id is not null then 'Marked' when em.employee_id is not null then 'Marked' else 'Not Marked' end,case when att.attendance_type='P' then check_in.attendance_reason when att.attendance_type='L' then lt.leave_type_name when el.employee_id is not null then lt.leave_type_name when att.attendance_type='HL' or att.attendance_type='HP' then 'Half Day' when att.attendance_type='W' or ewo.weekoff_id is not null then 'Week off' when (e.customer_id is null and cehv.id is not null) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when eh.employee_id is not null then 'Holiday' when ew.employee_id is not null then 'Week off' when em.employee_id is not null then 'Marketoff' else 'Absent' end ,case when att.on_behalf_attendance is not null then concat(man_behalf.first_name::text,man_behalf.last_name::text) else null end
,att.id,e.customer_id,al.id,aa1.id,att.imei,man.id,att.Uniform,att.Samsung_Logo,att.Blue_Color_Check,att.Blue_Color_Percentage,
al.Face_Detection_Flag,att.Check_Out_Qc_Review,att.Check_Out_Distance,sh.id,att.id,e.id;









share|improve this question
























  • I would recommend you to do 2 things. 1. Run a simple join between any two tables. 2. Run the same functionality using pyspark, after loading each table in separate RDDs and utilizing the join() function. Check if there is a difference in execution time.

    – Jim Todd
    Mar 7 at 17:50















-1















How to minimize my query execution time using pyspark?



I am using Postgres Database,
And spark installed in my local machine having 10GB RAM



Query Execution time in PgAdmin - 10 Sec



Query Execution time in Pyspark - 10 Sec



Find below is my pyspark code



from pyspark.sql import DataFrameReader

url = "jdbc:postgresql://168.23.233.4:5432/MyDatabase"
properties =
"driver": "org.postgresql.Driver",
"user": "postgres",
"password": "123"


df = sqlContext.read.jdbc(url=url,table="(select.. very big query limit 10) AS t", properties=properties)
df.show()


Query has to join more than 13 tables each table has 1 million rows.



Please help me to faster query using Spark.



I have try this based on this blog enter link description here.



Find Below query running inside the pyspark code,



select '2019-02-27' as "Attendance_date",e.id as e_id,concat(e.first_name::text, e.last_name::text) as "Employee_name",e.emp_id as "Employee_id",
e.user_id as "User_id",e.customer_id,att.id attendance_id, al.id as Attendance_logs_id,aa1.id as attendance_approval_id,
e.client_emp_id as "Client_employee_id", e.contact_no as "Contact_no",
att.imei as ImeiNumber,e.email_id as "Email_id",
concat(man.first_name::text, man.last_name::text) as "Manager_name", man.id as "Manager_id",
att.Uniform,att.Samsung_Logo,att.Blue_Color_Check,att.Blue_Color_Percentage,
al.Face_Detection_Flag,rl.role_name as "Role_name",b.branch_name as "Branch_name",
b.branch_code as "Branch_code",cty.city_name as "City",sm.state as "State",
gsv1.name as "Geo_Country",gsv2.name as "Geo_State"
,sh.shift_name as "Shift_name",sh.id as shift_id
,((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME)
as "Check_in_time"
,al.check_in_lat as "Check_in_latitude", al.check_in_long as "Check_in_longitude",
(select string_agg(value, ', ') from json_each_text(al.check_in_address::json))as "Check_in_address",
att.check_in_late as "Check_in_late_remarks",al.check_in_distance_variation as "Check_in_distance",al.check_in_selfie as "Check_in_selfie",
case when @aa1.approval_flag = 2 then ch_in.attendance_reason end as "Check_in_rejection_remarks",qc_ch_in.attendance_reason
as "Check_in_qc_review",
case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved'
when @aa2.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 2 THEN 'Rejected' when @aa2.approval_flag = 0
THEN 'Pending' else null END as "Check_out_status",
case when att.attendance_type='P' and @att.approval_flag = 1 or att.attendance_type='L' and el.approval_flag=1 or
att.attendance_type='H' and eh.approval_flag=1 or att.attendance_type='M' and em.approval_flag=1 or
att.attendance_type='W' and ew.approval_flag=1 then 'Approved'
when att.attendance_type='P' and @att.approval_flag = 0 or att.attendance_type='L' and el.approval_flag=0 or
att.attendance_type='H' and eh.approval_flag=0 or att.attendance_type='M' and em.approval_flag=0 or
att.attendance_type='W' and ew.approval_flag=0 then 'Waiting for Approval'
when att.attendance_type='P' and el.approval_flag=2 or att.attendance_type='H' and eh.approval_flag=2 or
att.attendance_type='M' and em.approval_flag=2 or att.attendance_type='W' and ew.approval_flag=2 then 'Rejected'
when att.attendance_type='P' and @att.approval_flag is null then '' else 'Waiting for Approval' end as "TL approval status",
case when att.attendance_type='P' then 'Marked'
when att.attendance_type='L' then 'Marked' when att.attendance_type='HL' or att.attendance_type='HP' then 'Marked'
when att.attendance_type='W' or ewo.weekoff_id is not null then 'Marked' when (e.customer_id is null and cehv.id is not null)
or (e.customer_id is not null and ehv.id is not null) then 'Holiday'when el.employee_id is not null then 'Marked'
when eh.employee_id is not null then 'Marked' when ew.employee_id is not null then 'Marked'
when em.employee_id is not null then 'Marked' else 'Not Marked' end "Status",
case when att.attendance_type='P'
then check_in.attendance_reason when att.attendance_type='L' then lt.leave_type_name when el.employee_id is not null
then lt.leave_type_name when att.attendance_type='HL' or att.attendance_type='HP' then 'Half Day'
when att.attendance_type='W' or ewo.weekoff_id is not null then 'Week off' when (e.customer_id is null and cehv.id is not null
) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when eh.employee_id is not null then 'Holiday' when
ew.employee_id is not null then 'Week off' when em.employee_id is not null then 'Marketoff' else 'Absent'
end as "Attendance_reason",
case when att.on_behalf_attendance is not null then concat(man_behalf.first_name::text,
man_behalf.last_name::text) else null end as "Onbehalf_name",att.Check_Out_Qc_Review,att.Check_Out_Distance,
al.Check_Out_Address
from employees e
left join employee_applied_holidays eh on eh.employee_id=e.id and date('2019-02-27') between eh.from_date and eh.to_date
left join employee_applied_weekoffs ew on ew.employee_id=e.id and date('2019-02-27') between ew.from_date and ew.to_date
left join employee_applied_marketoffs em on em.employee_id=e.id and date('2019-02-27') between em.from_date and em.to_date
inner join users u on u.ref_id = e.id and u.customer_id=200
inner join user_role_groups urg on u.id = urg.user_id and urg.active_flag = 1
inner join attendance_setups ass on ass.role_group_id = urg.role_group_id
left join attendances att on att.employee_id = e.id and att.start_date = '2019-02-27' and att.delete_flag = 0

left join employee_leaves el ON el.id=(select id from employee_leaves el2 where el2.employee_id=e.id and
el2.active_flag=1 and date('2019-02-27') between el2.from_date and el2.to_date order by id desc limit 1)

left join leave_types lt ON lt.id=(select leave_type from employee_leaves el where el.employee_id=e.id and
el.active_flag=1 and date('2019-02-27') between el.from_date and el.to_date order by id desc limit 1)

left join attendance_logs al on al.attendance_id = att.id and al.attendance_flag = 1
left join attendance_approvals aa1 on al.id = aa1.attendance_log_id and aa1.action = 1 and aa1.active_flag = 1
left join attendance_approvals aa2 on al.id = aa2.attendance_log_id and aa1.action = 2 and aa2.active_flag = 1
inner join branches b on b.id = e.branch_id left join employees man on man.id = e.manager_id
left join employees man_behalf on man_behalf.id = att.on_behalf_attendance
left join employee_weekoff ewo on e.id = ewo.emp_id and date_part('dow','2019-02-27'::TIMESTAMP)+1 = ewo.weekoff_id and
ewo.active_flag =1 left join employee_holidays_view ehv on e.id = ehv.id and ehv.holiday_date = '2019-02-27'
left join company_employee_holidays_view cehv on e.id = cehv.id and ehv.holiday_date = '2019-02-27'
inner join roles rl on rl.id = e.role_id inner join cities cty on cty.id = b.city_id
inner join states on states.id = b.state_id inner join state_master sm on sm.id = states.state_id
inner join countries on countries.id = b.country_id inner join country_master cm on cm.country_id = countries.country_id
left join shifts sh on sh.id = att.shift_id left join attendance_reasons ch_in on ch_in.id = aa1.reason_id
left join sessions se on sh.id=se.shift_id left join attendance_reasons ch_out on ch_out.id = aa2.reason_id
left join attendance_reasons qc_ch_in on qc_ch_in.id = att.check_in_qc_review
left join attendance_reasons qc_ch_out on qc_ch_out.id = att.check_out_qc_review
left join attendance_reasons check_in on check_in.id = al.reason_id
left join time_zones tz on b.timezone = tz.time_zone inner join geo_outlet_mapping gom
on b.id = gom.outlet_id left join geo_structure_values gsv1 on gsv1.id = gom.level1 left join
geo_structure_values gsv2 on gsv2.id = gom.level2 left join geo_structure_values gsv3 on gsv3.id = gom.level3
where e.customer_id=200

group by concat(e.first_name::text, e.last_name::text) ,e.emp_id ,e.user_id ,e.client_emp_id , e.contact_no , e.email_id,e.profile_picture,(select string_agg(role_group_name, ', ') from role_group where role_group_id = any((select array_agg(role_group_id) from user_role_groups where user_id = u.id and active_flag = 1)::int[])),concat(man.first_name::text, man.last_name::text), rl.role_name,b.branch_name,b.branch_code,cty.city_name,sm.state,cm.country,gsv1.name,gsv2.name,case when @ass.reference_point = 1 THEN b.latitude else e.latitude END,case when @ass.reference_point = 1 THEN b.longitude else e.longitude END,sh.shift_name,sh.start_time, sh.end_time,((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME),case
when current_date='2019-02-27' and sh.end_time<cast(current_time as time without time zone) then null else (case when se.check_out_flag=1 then cast(att.total_hours as interval) when se.check_out_flag=0 then sh.end_time-((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME) end) end,al.check_in_lat, al.check_in_long,(select string_agg(value, ', ') from json_each_text(al.check_in_address::json)),att.check_in_late,al.check_in_distance_variation,al.check_in_selfie,case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved' when @aa1.approval_flag = 1 THEN 'Approved' when @aa1.approval_flag = 2 THEN 'Rejected' when @aa1.approval_flag = 0 THEN 'Pending' else null END,case when @aa1.approval_flag = 2 then ch_in.attendance_reason end,qc_ch_in.attendance_reason,case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 2 THEN 'Rejected' when @aa2.approval_flag = 0 THEN 'Pending' else null END,
case when att.attendance_type='P' and @att.approval_flag = 1 or att.attendance_type='L' and el.approval_flag=1 or
att.attendance_type='H' and eh.approval_flag=1 or att.attendance_type='M' and em.approval_flag=1 or
att.attendance_type='W' and ew.approval_flag=1 then 'Approved'
when att.attendance_type='P' and @att.approval_flag = 0 or att.attendance_type='L' and el.approval_flag=0 or
att.attendance_type='H' and eh.approval_flag=0 or att.attendance_type='M' and em.approval_flag=0 or
att.attendance_type='W' and ew.approval_flag=0 then 'Waiting for Approval'
when att.attendance_type='P' and el.approval_flag=2 or att.attendance_type='H' and eh.approval_flag=2 or
att.attendance_type='M' and em.approval_flag=2 or att.attendance_type='W' and ew.approval_flag=2 then 'Rejected'
when att.attendance_type='P' and @att.approval_flag is null then '' else 'Waiting for Approval' end,
case when att.attendance_type='P' then 'Marked' when att.attendance_type='L' then 'Marked'
when att.attendance_type='HL' or att.attendance_type='HP' then 'Marked' when att.attendance_type='W'
or ewo.weekoff_id is not null then 'Marked' when (e.customer_id is null and cehv.id is not null) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when el.employee_id is not null then 'Marked' when eh.employee_id is not null then 'Marked' when ew.employee_id is not null then 'Marked' when em.employee_id is not null then 'Marked' else 'Not Marked' end,case when att.attendance_type='P' then check_in.attendance_reason when att.attendance_type='L' then lt.leave_type_name when el.employee_id is not null then lt.leave_type_name when att.attendance_type='HL' or att.attendance_type='HP' then 'Half Day' when att.attendance_type='W' or ewo.weekoff_id is not null then 'Week off' when (e.customer_id is null and cehv.id is not null) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when eh.employee_id is not null then 'Holiday' when ew.employee_id is not null then 'Week off' when em.employee_id is not null then 'Marketoff' else 'Absent' end ,case when att.on_behalf_attendance is not null then concat(man_behalf.first_name::text,man_behalf.last_name::text) else null end
,att.id,e.customer_id,al.id,aa1.id,att.imei,man.id,att.Uniform,att.Samsung_Logo,att.Blue_Color_Check,att.Blue_Color_Percentage,
al.Face_Detection_Flag,att.Check_Out_Qc_Review,att.Check_Out_Distance,sh.id,att.id,e.id;









share|improve this question
























  • I would recommend you to do 2 things. 1. Run a simple join between any two tables. 2. Run the same functionality using pyspark, after loading each table in separate RDDs and utilizing the join() function. Check if there is a difference in execution time.

    – Jim Todd
    Mar 7 at 17:50













-1












-1








-1








How to minimize my query execution time using pyspark?



I am using Postgres Database,
And spark installed in my local machine having 10GB RAM



Query Execution time in PgAdmin - 10 Sec



Query Execution time in Pyspark - 10 Sec



Find below is my pyspark code



from pyspark.sql import DataFrameReader

url = "jdbc:postgresql://168.23.233.4:5432/MyDatabase"
properties =
"driver": "org.postgresql.Driver",
"user": "postgres",
"password": "123"


df = sqlContext.read.jdbc(url=url,table="(select.. very big query limit 10) AS t", properties=properties)
df.show()


Query has to join more than 13 tables each table has 1 million rows.



Please help me to faster query using Spark.



I have try this based on this blog enter link description here.



Find Below query running inside the pyspark code,



select '2019-02-27' as "Attendance_date",e.id as e_id,concat(e.first_name::text, e.last_name::text) as "Employee_name",e.emp_id as "Employee_id",
e.user_id as "User_id",e.customer_id,att.id attendance_id, al.id as Attendance_logs_id,aa1.id as attendance_approval_id,
e.client_emp_id as "Client_employee_id", e.contact_no as "Contact_no",
att.imei as ImeiNumber,e.email_id as "Email_id",
concat(man.first_name::text, man.last_name::text) as "Manager_name", man.id as "Manager_id",
att.Uniform,att.Samsung_Logo,att.Blue_Color_Check,att.Blue_Color_Percentage,
al.Face_Detection_Flag,rl.role_name as "Role_name",b.branch_name as "Branch_name",
b.branch_code as "Branch_code",cty.city_name as "City",sm.state as "State",
gsv1.name as "Geo_Country",gsv2.name as "Geo_State"
,sh.shift_name as "Shift_name",sh.id as shift_id
,((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME)
as "Check_in_time"
,al.check_in_lat as "Check_in_latitude", al.check_in_long as "Check_in_longitude",
(select string_agg(value, ', ') from json_each_text(al.check_in_address::json))as "Check_in_address",
att.check_in_late as "Check_in_late_remarks",al.check_in_distance_variation as "Check_in_distance",al.check_in_selfie as "Check_in_selfie",
case when @aa1.approval_flag = 2 then ch_in.attendance_reason end as "Check_in_rejection_remarks",qc_ch_in.attendance_reason
as "Check_in_qc_review",
case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved'
when @aa2.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 2 THEN 'Rejected' when @aa2.approval_flag = 0
THEN 'Pending' else null END as "Check_out_status",
case when att.attendance_type='P' and @att.approval_flag = 1 or att.attendance_type='L' and el.approval_flag=1 or
att.attendance_type='H' and eh.approval_flag=1 or att.attendance_type='M' and em.approval_flag=1 or
att.attendance_type='W' and ew.approval_flag=1 then 'Approved'
when att.attendance_type='P' and @att.approval_flag = 0 or att.attendance_type='L' and el.approval_flag=0 or
att.attendance_type='H' and eh.approval_flag=0 or att.attendance_type='M' and em.approval_flag=0 or
att.attendance_type='W' and ew.approval_flag=0 then 'Waiting for Approval'
when att.attendance_type='P' and el.approval_flag=2 or att.attendance_type='H' and eh.approval_flag=2 or
att.attendance_type='M' and em.approval_flag=2 or att.attendance_type='W' and ew.approval_flag=2 then 'Rejected'
when att.attendance_type='P' and @att.approval_flag is null then '' else 'Waiting for Approval' end as "TL approval status",
case when att.attendance_type='P' then 'Marked'
when att.attendance_type='L' then 'Marked' when att.attendance_type='HL' or att.attendance_type='HP' then 'Marked'
when att.attendance_type='W' or ewo.weekoff_id is not null then 'Marked' when (e.customer_id is null and cehv.id is not null)
or (e.customer_id is not null and ehv.id is not null) then 'Holiday'when el.employee_id is not null then 'Marked'
when eh.employee_id is not null then 'Marked' when ew.employee_id is not null then 'Marked'
when em.employee_id is not null then 'Marked' else 'Not Marked' end "Status",
case when att.attendance_type='P'
then check_in.attendance_reason when att.attendance_type='L' then lt.leave_type_name when el.employee_id is not null
then lt.leave_type_name when att.attendance_type='HL' or att.attendance_type='HP' then 'Half Day'
when att.attendance_type='W' or ewo.weekoff_id is not null then 'Week off' when (e.customer_id is null and cehv.id is not null
) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when eh.employee_id is not null then 'Holiday' when
ew.employee_id is not null then 'Week off' when em.employee_id is not null then 'Marketoff' else 'Absent'
end as "Attendance_reason",
case when att.on_behalf_attendance is not null then concat(man_behalf.first_name::text,
man_behalf.last_name::text) else null end as "Onbehalf_name",att.Check_Out_Qc_Review,att.Check_Out_Distance,
al.Check_Out_Address
from employees e
left join employee_applied_holidays eh on eh.employee_id=e.id and date('2019-02-27') between eh.from_date and eh.to_date
left join employee_applied_weekoffs ew on ew.employee_id=e.id and date('2019-02-27') between ew.from_date and ew.to_date
left join employee_applied_marketoffs em on em.employee_id=e.id and date('2019-02-27') between em.from_date and em.to_date
inner join users u on u.ref_id = e.id and u.customer_id=200
inner join user_role_groups urg on u.id = urg.user_id and urg.active_flag = 1
inner join attendance_setups ass on ass.role_group_id = urg.role_group_id
left join attendances att on att.employee_id = e.id and att.start_date = '2019-02-27' and att.delete_flag = 0

left join employee_leaves el ON el.id=(select id from employee_leaves el2 where el2.employee_id=e.id and
el2.active_flag=1 and date('2019-02-27') between el2.from_date and el2.to_date order by id desc limit 1)

left join leave_types lt ON lt.id=(select leave_type from employee_leaves el where el.employee_id=e.id and
el.active_flag=1 and date('2019-02-27') between el.from_date and el.to_date order by id desc limit 1)

left join attendance_logs al on al.attendance_id = att.id and al.attendance_flag = 1
left join attendance_approvals aa1 on al.id = aa1.attendance_log_id and aa1.action = 1 and aa1.active_flag = 1
left join attendance_approvals aa2 on al.id = aa2.attendance_log_id and aa1.action = 2 and aa2.active_flag = 1
inner join branches b on b.id = e.branch_id left join employees man on man.id = e.manager_id
left join employees man_behalf on man_behalf.id = att.on_behalf_attendance
left join employee_weekoff ewo on e.id = ewo.emp_id and date_part('dow','2019-02-27'::TIMESTAMP)+1 = ewo.weekoff_id and
ewo.active_flag =1 left join employee_holidays_view ehv on e.id = ehv.id and ehv.holiday_date = '2019-02-27'
left join company_employee_holidays_view cehv on e.id = cehv.id and ehv.holiday_date = '2019-02-27'
inner join roles rl on rl.id = e.role_id inner join cities cty on cty.id = b.city_id
inner join states on states.id = b.state_id inner join state_master sm on sm.id = states.state_id
inner join countries on countries.id = b.country_id inner join country_master cm on cm.country_id = countries.country_id
left join shifts sh on sh.id = att.shift_id left join attendance_reasons ch_in on ch_in.id = aa1.reason_id
left join sessions se on sh.id=se.shift_id left join attendance_reasons ch_out on ch_out.id = aa2.reason_id
left join attendance_reasons qc_ch_in on qc_ch_in.id = att.check_in_qc_review
left join attendance_reasons qc_ch_out on qc_ch_out.id = att.check_out_qc_review
left join attendance_reasons check_in on check_in.id = al.reason_id
left join time_zones tz on b.timezone = tz.time_zone inner join geo_outlet_mapping gom
on b.id = gom.outlet_id left join geo_structure_values gsv1 on gsv1.id = gom.level1 left join
geo_structure_values gsv2 on gsv2.id = gom.level2 left join geo_structure_values gsv3 on gsv3.id = gom.level3
where e.customer_id=200

group by concat(e.first_name::text, e.last_name::text) ,e.emp_id ,e.user_id ,e.client_emp_id , e.contact_no , e.email_id,e.profile_picture,(select string_agg(role_group_name, ', ') from role_group where role_group_id = any((select array_agg(role_group_id) from user_role_groups where user_id = u.id and active_flag = 1)::int[])),concat(man.first_name::text, man.last_name::text), rl.role_name,b.branch_name,b.branch_code,cty.city_name,sm.state,cm.country,gsv1.name,gsv2.name,case when @ass.reference_point = 1 THEN b.latitude else e.latitude END,case when @ass.reference_point = 1 THEN b.longitude else e.longitude END,sh.shift_name,sh.start_time, sh.end_time,((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME),case
when current_date='2019-02-27' and sh.end_time<cast(current_time as time without time zone) then null else (case when se.check_out_flag=1 then cast(att.total_hours as interval) when se.check_out_flag=0 then sh.end_time-((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME) end) end,al.check_in_lat, al.check_in_long,(select string_agg(value, ', ') from json_each_text(al.check_in_address::json)),att.check_in_late,al.check_in_distance_variation,al.check_in_selfie,case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved' when @aa1.approval_flag = 1 THEN 'Approved' when @aa1.approval_flag = 2 THEN 'Rejected' when @aa1.approval_flag = 0 THEN 'Pending' else null END,case when @aa1.approval_flag = 2 then ch_in.attendance_reason end,qc_ch_in.attendance_reason,case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 2 THEN 'Rejected' when @aa2.approval_flag = 0 THEN 'Pending' else null END,
case when att.attendance_type='P' and @att.approval_flag = 1 or att.attendance_type='L' and el.approval_flag=1 or
att.attendance_type='H' and eh.approval_flag=1 or att.attendance_type='M' and em.approval_flag=1 or
att.attendance_type='W' and ew.approval_flag=1 then 'Approved'
when att.attendance_type='P' and @att.approval_flag = 0 or att.attendance_type='L' and el.approval_flag=0 or
att.attendance_type='H' and eh.approval_flag=0 or att.attendance_type='M' and em.approval_flag=0 or
att.attendance_type='W' and ew.approval_flag=0 then 'Waiting for Approval'
when att.attendance_type='P' and el.approval_flag=2 or att.attendance_type='H' and eh.approval_flag=2 or
att.attendance_type='M' and em.approval_flag=2 or att.attendance_type='W' and ew.approval_flag=2 then 'Rejected'
when att.attendance_type='P' and @att.approval_flag is null then '' else 'Waiting for Approval' end,
case when att.attendance_type='P' then 'Marked' when att.attendance_type='L' then 'Marked'
when att.attendance_type='HL' or att.attendance_type='HP' then 'Marked' when att.attendance_type='W'
or ewo.weekoff_id is not null then 'Marked' when (e.customer_id is null and cehv.id is not null) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when el.employee_id is not null then 'Marked' when eh.employee_id is not null then 'Marked' when ew.employee_id is not null then 'Marked' when em.employee_id is not null then 'Marked' else 'Not Marked' end,case when att.attendance_type='P' then check_in.attendance_reason when att.attendance_type='L' then lt.leave_type_name when el.employee_id is not null then lt.leave_type_name when att.attendance_type='HL' or att.attendance_type='HP' then 'Half Day' when att.attendance_type='W' or ewo.weekoff_id is not null then 'Week off' when (e.customer_id is null and cehv.id is not null) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when eh.employee_id is not null then 'Holiday' when ew.employee_id is not null then 'Week off' when em.employee_id is not null then 'Marketoff' else 'Absent' end ,case when att.on_behalf_attendance is not null then concat(man_behalf.first_name::text,man_behalf.last_name::text) else null end
,att.id,e.customer_id,al.id,aa1.id,att.imei,man.id,att.Uniform,att.Samsung_Logo,att.Blue_Color_Check,att.Blue_Color_Percentage,
al.Face_Detection_Flag,att.Check_Out_Qc_Review,att.Check_Out_Distance,sh.id,att.id,e.id;









share|improve this question
















How to minimize my query execution time using pyspark?



I am using Postgres Database,
And spark installed in my local machine having 10GB RAM



Query Execution time in PgAdmin - 10 Sec



Query Execution time in Pyspark - 10 Sec



Find below is my pyspark code



from pyspark.sql import DataFrameReader

url = "jdbc:postgresql://168.23.233.4:5432/MyDatabase"
properties =
"driver": "org.postgresql.Driver",
"user": "postgres",
"password": "123"


df = sqlContext.read.jdbc(url=url,table="(select.. very big query limit 10) AS t", properties=properties)
df.show()


Query has to join more than 13 tables each table has 1 million rows.



Please help me to faster query using Spark.



I have try this based on this blog enter link description here.



Find Below query running inside the pyspark code,



select '2019-02-27' as "Attendance_date",e.id as e_id,concat(e.first_name::text, e.last_name::text) as "Employee_name",e.emp_id as "Employee_id",
e.user_id as "User_id",e.customer_id,att.id attendance_id, al.id as Attendance_logs_id,aa1.id as attendance_approval_id,
e.client_emp_id as "Client_employee_id", e.contact_no as "Contact_no",
att.imei as ImeiNumber,e.email_id as "Email_id",
concat(man.first_name::text, man.last_name::text) as "Manager_name", man.id as "Manager_id",
att.Uniform,att.Samsung_Logo,att.Blue_Color_Check,att.Blue_Color_Percentage,
al.Face_Detection_Flag,rl.role_name as "Role_name",b.branch_name as "Branch_name",
b.branch_code as "Branch_code",cty.city_name as "City",sm.state as "State",
gsv1.name as "Geo_Country",gsv2.name as "Geo_State"
,sh.shift_name as "Shift_name",sh.id as shift_id
,((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME)
as "Check_in_time"
,al.check_in_lat as "Check_in_latitude", al.check_in_long as "Check_in_longitude",
(select string_agg(value, ', ') from json_each_text(al.check_in_address::json))as "Check_in_address",
att.check_in_late as "Check_in_late_remarks",al.check_in_distance_variation as "Check_in_distance",al.check_in_selfie as "Check_in_selfie",
case when @aa1.approval_flag = 2 then ch_in.attendance_reason end as "Check_in_rejection_remarks",qc_ch_in.attendance_reason
as "Check_in_qc_review",
case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved'
when @aa2.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 2 THEN 'Rejected' when @aa2.approval_flag = 0
THEN 'Pending' else null END as "Check_out_status",
case when att.attendance_type='P' and @att.approval_flag = 1 or att.attendance_type='L' and el.approval_flag=1 or
att.attendance_type='H' and eh.approval_flag=1 or att.attendance_type='M' and em.approval_flag=1 or
att.attendance_type='W' and ew.approval_flag=1 then 'Approved'
when att.attendance_type='P' and @att.approval_flag = 0 or att.attendance_type='L' and el.approval_flag=0 or
att.attendance_type='H' and eh.approval_flag=0 or att.attendance_type='M' and em.approval_flag=0 or
att.attendance_type='W' and ew.approval_flag=0 then 'Waiting for Approval'
when att.attendance_type='P' and el.approval_flag=2 or att.attendance_type='H' and eh.approval_flag=2 or
att.attendance_type='M' and em.approval_flag=2 or att.attendance_type='W' and ew.approval_flag=2 then 'Rejected'
when att.attendance_type='P' and @att.approval_flag is null then '' else 'Waiting for Approval' end as "TL approval status",
case when att.attendance_type='P' then 'Marked'
when att.attendance_type='L' then 'Marked' when att.attendance_type='HL' or att.attendance_type='HP' then 'Marked'
when att.attendance_type='W' or ewo.weekoff_id is not null then 'Marked' when (e.customer_id is null and cehv.id is not null)
or (e.customer_id is not null and ehv.id is not null) then 'Holiday'when el.employee_id is not null then 'Marked'
when eh.employee_id is not null then 'Marked' when ew.employee_id is not null then 'Marked'
when em.employee_id is not null then 'Marked' else 'Not Marked' end "Status",
case when att.attendance_type='P'
then check_in.attendance_reason when att.attendance_type='L' then lt.leave_type_name when el.employee_id is not null
then lt.leave_type_name when att.attendance_type='HL' or att.attendance_type='HP' then 'Half Day'
when att.attendance_type='W' or ewo.weekoff_id is not null then 'Week off' when (e.customer_id is null and cehv.id is not null
) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when eh.employee_id is not null then 'Holiday' when
ew.employee_id is not null then 'Week off' when em.employee_id is not null then 'Marketoff' else 'Absent'
end as "Attendance_reason",
case when att.on_behalf_attendance is not null then concat(man_behalf.first_name::text,
man_behalf.last_name::text) else null end as "Onbehalf_name",att.Check_Out_Qc_Review,att.Check_Out_Distance,
al.Check_Out_Address
from employees e
left join employee_applied_holidays eh on eh.employee_id=e.id and date('2019-02-27') between eh.from_date and eh.to_date
left join employee_applied_weekoffs ew on ew.employee_id=e.id and date('2019-02-27') between ew.from_date and ew.to_date
left join employee_applied_marketoffs em on em.employee_id=e.id and date('2019-02-27') between em.from_date and em.to_date
inner join users u on u.ref_id = e.id and u.customer_id=200
inner join user_role_groups urg on u.id = urg.user_id and urg.active_flag = 1
inner join attendance_setups ass on ass.role_group_id = urg.role_group_id
left join attendances att on att.employee_id = e.id and att.start_date = '2019-02-27' and att.delete_flag = 0

left join employee_leaves el ON el.id=(select id from employee_leaves el2 where el2.employee_id=e.id and
el2.active_flag=1 and date('2019-02-27') between el2.from_date and el2.to_date order by id desc limit 1)

left join leave_types lt ON lt.id=(select leave_type from employee_leaves el where el.employee_id=e.id and
el.active_flag=1 and date('2019-02-27') between el.from_date and el.to_date order by id desc limit 1)

left join attendance_logs al on al.attendance_id = att.id and al.attendance_flag = 1
left join attendance_approvals aa1 on al.id = aa1.attendance_log_id and aa1.action = 1 and aa1.active_flag = 1
left join attendance_approvals aa2 on al.id = aa2.attendance_log_id and aa1.action = 2 and aa2.active_flag = 1
inner join branches b on b.id = e.branch_id left join employees man on man.id = e.manager_id
left join employees man_behalf on man_behalf.id = att.on_behalf_attendance
left join employee_weekoff ewo on e.id = ewo.emp_id and date_part('dow','2019-02-27'::TIMESTAMP)+1 = ewo.weekoff_id and
ewo.active_flag =1 left join employee_holidays_view ehv on e.id = ehv.id and ehv.holiday_date = '2019-02-27'
left join company_employee_holidays_view cehv on e.id = cehv.id and ehv.holiday_date = '2019-02-27'
inner join roles rl on rl.id = e.role_id inner join cities cty on cty.id = b.city_id
inner join states on states.id = b.state_id inner join state_master sm on sm.id = states.state_id
inner join countries on countries.id = b.country_id inner join country_master cm on cm.country_id = countries.country_id
left join shifts sh on sh.id = att.shift_id left join attendance_reasons ch_in on ch_in.id = aa1.reason_id
left join sessions se on sh.id=se.shift_id left join attendance_reasons ch_out on ch_out.id = aa2.reason_id
left join attendance_reasons qc_ch_in on qc_ch_in.id = att.check_in_qc_review
left join attendance_reasons qc_ch_out on qc_ch_out.id = att.check_out_qc_review
left join attendance_reasons check_in on check_in.id = al.reason_id
left join time_zones tz on b.timezone = tz.time_zone inner join geo_outlet_mapping gom
on b.id = gom.outlet_id left join geo_structure_values gsv1 on gsv1.id = gom.level1 left join
geo_structure_values gsv2 on gsv2.id = gom.level2 left join geo_structure_values gsv3 on gsv3.id = gom.level3
where e.customer_id=200

group by concat(e.first_name::text, e.last_name::text) ,e.emp_id ,e.user_id ,e.client_emp_id , e.contact_no , e.email_id,e.profile_picture,(select string_agg(role_group_name, ', ') from role_group where role_group_id = any((select array_agg(role_group_id) from user_role_groups where user_id = u.id and active_flag = 1)::int[])),concat(man.first_name::text, man.last_name::text), rl.role_name,b.branch_name,b.branch_code,cty.city_name,sm.state,cm.country,gsv1.name,gsv2.name,case when @ass.reference_point = 1 THEN b.latitude else e.latitude END,case when @ass.reference_point = 1 THEN b.longitude else e.longitude END,sh.shift_name,sh.start_time, sh.end_time,((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME),case
when current_date='2019-02-27' and sh.end_time<cast(current_time as time without time zone) then null else (case when se.check_out_flag=1 then cast(att.total_hours as interval) when se.check_out_flag=0 then sh.end_time-((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME) end) end,al.check_in_lat, al.check_in_long,(select string_agg(value, ', ') from json_each_text(al.check_in_address::json)),att.check_in_late,al.check_in_distance_variation,al.check_in_selfie,case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved' when @aa1.approval_flag = 1 THEN 'Approved' when @aa1.approval_flag = 2 THEN 'Rejected' when @aa1.approval_flag = 0 THEN 'Pending' else null END,case when @aa1.approval_flag = 2 then ch_in.attendance_reason end,qc_ch_in.attendance_reason,case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 2 THEN 'Rejected' when @aa2.approval_flag = 0 THEN 'Pending' else null END,
case when att.attendance_type='P' and @att.approval_flag = 1 or att.attendance_type='L' and el.approval_flag=1 or
att.attendance_type='H' and eh.approval_flag=1 or att.attendance_type='M' and em.approval_flag=1 or
att.attendance_type='W' and ew.approval_flag=1 then 'Approved'
when att.attendance_type='P' and @att.approval_flag = 0 or att.attendance_type='L' and el.approval_flag=0 or
att.attendance_type='H' and eh.approval_flag=0 or att.attendance_type='M' and em.approval_flag=0 or
att.attendance_type='W' and ew.approval_flag=0 then 'Waiting for Approval'
when att.attendance_type='P' and el.approval_flag=2 or att.attendance_type='H' and eh.approval_flag=2 or
att.attendance_type='M' and em.approval_flag=2 or att.attendance_type='W' and ew.approval_flag=2 then 'Rejected'
when att.attendance_type='P' and @att.approval_flag is null then '' else 'Waiting for Approval' end,
case when att.attendance_type='P' then 'Marked' when att.attendance_type='L' then 'Marked'
when att.attendance_type='HL' or att.attendance_type='HP' then 'Marked' when att.attendance_type='W'
or ewo.weekoff_id is not null then 'Marked' when (e.customer_id is null and cehv.id is not null) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when el.employee_id is not null then 'Marked' when eh.employee_id is not null then 'Marked' when ew.employee_id is not null then 'Marked' when em.employee_id is not null then 'Marked' else 'Not Marked' end,case when att.attendance_type='P' then check_in.attendance_reason when att.attendance_type='L' then lt.leave_type_name when el.employee_id is not null then lt.leave_type_name when att.attendance_type='HL' or att.attendance_type='HP' then 'Half Day' when att.attendance_type='W' or ewo.weekoff_id is not null then 'Week off' when (e.customer_id is null and cehv.id is not null) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when eh.employee_id is not null then 'Holiday' when ew.employee_id is not null then 'Week off' when em.employee_id is not null then 'Marketoff' else 'Absent' end ,case when att.on_behalf_attendance is not null then concat(man_behalf.first_name::text,man_behalf.last_name::text) else null end
,att.id,e.customer_id,al.id,aa1.id,att.imei,man.id,att.Uniform,att.Samsung_Logo,att.Blue_Color_Check,att.Blue_Color_Percentage,
al.Face_Detection_Flag,att.Check_Out_Qc_Review,att.Check_Out_Distance,sh.id,att.id,e.id;






postgresql apache-spark pyspark apache-spark-sql pyspark-sql






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 7 at 12:06







Srinivasan E

















asked Mar 7 at 4:45









Srinivasan ESrinivasan E

12




12












  • I would recommend you to do 2 things. 1. Run a simple join between any two tables. 2. Run the same functionality using pyspark, after loading each table in separate RDDs and utilizing the join() function. Check if there is a difference in execution time.

    – Jim Todd
    Mar 7 at 17:50

















  • I would recommend you to do 2 things. 1. Run a simple join between any two tables. 2. Run the same functionality using pyspark, after loading each table in separate RDDs and utilizing the join() function. Check if there is a difference in execution time.

    – Jim Todd
    Mar 7 at 17:50
















I would recommend you to do 2 things. 1. Run a simple join between any two tables. 2. Run the same functionality using pyspark, after loading each table in separate RDDs and utilizing the join() function. Check if there is a difference in execution time.

– Jim Todd
Mar 7 at 17:50





I would recommend you to do 2 things. 1. Run a simple join between any two tables. 2. Run the same functionality using pyspark, after loading each table in separate RDDs and utilizing the join() function. Check if there is a difference in execution time.

– Jim Todd
Mar 7 at 17:50












1 Answer
1






active

oldest

votes


















-1














As far as I know, in this case, Query execution time in pyspark and pgAdmin would obviously take the same time, as both the queries are getting executed on top of Postgres DB only.



At this point, you have not yet utilized the distributed computing and storage functionality of spark. You have just created a RDD out of the output of SQL from Postgres DB. Only, after this point your operations with this RDD will show a difference in speed.



So, optimization would be on the Postgres DB side only. Below points will help:



  1. Optimize your SQL so that it runs faster

  2. Read chunks of tables(simple SQLs) into RDD, and consider doing actions/transformations in pyspark for achieving desired results
    instead of the SQL with complex joins.





share|improve this answer


















  • 1





    This is just hand-waving. Unless the query and its execution plan are known, nobody can say what the problem is. What I object to is the assumption that doing the work in the application will be faster than doing it in SQL. While there are certainly cases where this is true, my personal experience says that it is usually the other way around.

    – Laurenz Albe
    Mar 7 at 7:02











  • A (complex)SQL on top of a relational database is comparatively slower to a pyspark operation that utilizes distributed computing technique(multiple nodes). Just like, counting the number of pages by a single person vs counting equal chunks of the book by 10 people and aggregating it, Isnt it faster? Thats how I articulated the 2nd point in my answer.

    – Jim Todd
    Mar 7 at 9:35






  • 1





    You are potentially right, if the workload can conveniently parallelized in the application in a way that PostgreSQL cannot do better with its internal parallel query. A big "if", since we know nothing about the query in question.

    – Laurenz Albe
    Mar 7 at 11:00











  • The "if" could be trivial here. Because, people use Spark majorly to handle huge data and lessen its query/operational time. Principle underlying this is obviously a distributed storage and computation. "RDD" is Resilient Distributed Data. So, it would be a advantage of parallelism in spark that makes a difference.

    – Jim Todd
    Mar 7 at 14:27











  • Point taken. Let's see if and what OP answers.

    – Laurenz Albe
    Mar 7 at 15:37










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55036256%2fhow-to-make-my-postgres-query-faster-using-apache-spark%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









-1














As far as I know, in this case, Query execution time in pyspark and pgAdmin would obviously take the same time, as both the queries are getting executed on top of Postgres DB only.



At this point, you have not yet utilized the distributed computing and storage functionality of spark. You have just created a RDD out of the output of SQL from Postgres DB. Only, after this point your operations with this RDD will show a difference in speed.



So, optimization would be on the Postgres DB side only. Below points will help:



  1. Optimize your SQL so that it runs faster

  2. Read chunks of tables(simple SQLs) into RDD, and consider doing actions/transformations in pyspark for achieving desired results
    instead of the SQL with complex joins.





share|improve this answer


















  • 1





    This is just hand-waving. Unless the query and its execution plan are known, nobody can say what the problem is. What I object to is the assumption that doing the work in the application will be faster than doing it in SQL. While there are certainly cases where this is true, my personal experience says that it is usually the other way around.

    – Laurenz Albe
    Mar 7 at 7:02











  • A (complex)SQL on top of a relational database is comparatively slower to a pyspark operation that utilizes distributed computing technique(multiple nodes). Just like, counting the number of pages by a single person vs counting equal chunks of the book by 10 people and aggregating it, Isnt it faster? Thats how I articulated the 2nd point in my answer.

    – Jim Todd
    Mar 7 at 9:35






  • 1





    You are potentially right, if the workload can conveniently parallelized in the application in a way that PostgreSQL cannot do better with its internal parallel query. A big "if", since we know nothing about the query in question.

    – Laurenz Albe
    Mar 7 at 11:00











  • The "if" could be trivial here. Because, people use Spark majorly to handle huge data and lessen its query/operational time. Principle underlying this is obviously a distributed storage and computation. "RDD" is Resilient Distributed Data. So, it would be a advantage of parallelism in spark that makes a difference.

    – Jim Todd
    Mar 7 at 14:27











  • Point taken. Let's see if and what OP answers.

    – Laurenz Albe
    Mar 7 at 15:37















-1














As far as I know, in this case, Query execution time in pyspark and pgAdmin would obviously take the same time, as both the queries are getting executed on top of Postgres DB only.



At this point, you have not yet utilized the distributed computing and storage functionality of spark. You have just created a RDD out of the output of SQL from Postgres DB. Only, after this point your operations with this RDD will show a difference in speed.



So, optimization would be on the Postgres DB side only. Below points will help:



  1. Optimize your SQL so that it runs faster

  2. Read chunks of tables(simple SQLs) into RDD, and consider doing actions/transformations in pyspark for achieving desired results
    instead of the SQL with complex joins.





share|improve this answer


















  • 1





    This is just hand-waving. Unless the query and its execution plan are known, nobody can say what the problem is. What I object to is the assumption that doing the work in the application will be faster than doing it in SQL. While there are certainly cases where this is true, my personal experience says that it is usually the other way around.

    – Laurenz Albe
    Mar 7 at 7:02











  • A (complex)SQL on top of a relational database is comparatively slower to a pyspark operation that utilizes distributed computing technique(multiple nodes). Just like, counting the number of pages by a single person vs counting equal chunks of the book by 10 people and aggregating it, Isnt it faster? Thats how I articulated the 2nd point in my answer.

    – Jim Todd
    Mar 7 at 9:35






  • 1





    You are potentially right, if the workload can conveniently parallelized in the application in a way that PostgreSQL cannot do better with its internal parallel query. A big "if", since we know nothing about the query in question.

    – Laurenz Albe
    Mar 7 at 11:00











  • The "if" could be trivial here. Because, people use Spark majorly to handle huge data and lessen its query/operational time. Principle underlying this is obviously a distributed storage and computation. "RDD" is Resilient Distributed Data. So, it would be a advantage of parallelism in spark that makes a difference.

    – Jim Todd
    Mar 7 at 14:27











  • Point taken. Let's see if and what OP answers.

    – Laurenz Albe
    Mar 7 at 15:37













-1












-1








-1







As far as I know, in this case, Query execution time in pyspark and pgAdmin would obviously take the same time, as both the queries are getting executed on top of Postgres DB only.



At this point, you have not yet utilized the distributed computing and storage functionality of spark. You have just created a RDD out of the output of SQL from Postgres DB. Only, after this point your operations with this RDD will show a difference in speed.



So, optimization would be on the Postgres DB side only. Below points will help:



  1. Optimize your SQL so that it runs faster

  2. Read chunks of tables(simple SQLs) into RDD, and consider doing actions/transformations in pyspark for achieving desired results
    instead of the SQL with complex joins.





share|improve this answer













As far as I know, in this case, Query execution time in pyspark and pgAdmin would obviously take the same time, as both the queries are getting executed on top of Postgres DB only.



At this point, you have not yet utilized the distributed computing and storage functionality of spark. You have just created a RDD out of the output of SQL from Postgres DB. Only, after this point your operations with this RDD will show a difference in speed.



So, optimization would be on the Postgres DB side only. Below points will help:



  1. Optimize your SQL so that it runs faster

  2. Read chunks of tables(simple SQLs) into RDD, and consider doing actions/transformations in pyspark for achieving desired results
    instead of the SQL with complex joins.






share|improve this answer












share|improve this answer



share|improve this answer










answered Mar 7 at 5:18









Jim ToddJim Todd

589311




589311







  • 1





    This is just hand-waving. Unless the query and its execution plan are known, nobody can say what the problem is. What I object to is the assumption that doing the work in the application will be faster than doing it in SQL. While there are certainly cases where this is true, my personal experience says that it is usually the other way around.

    – Laurenz Albe
    Mar 7 at 7:02











  • A (complex)SQL on top of a relational database is comparatively slower to a pyspark operation that utilizes distributed computing technique(multiple nodes). Just like, counting the number of pages by a single person vs counting equal chunks of the book by 10 people and aggregating it, Isnt it faster? Thats how I articulated the 2nd point in my answer.

    – Jim Todd
    Mar 7 at 9:35






  • 1





    You are potentially right, if the workload can conveniently parallelized in the application in a way that PostgreSQL cannot do better with its internal parallel query. A big "if", since we know nothing about the query in question.

    – Laurenz Albe
    Mar 7 at 11:00











  • The "if" could be trivial here. Because, people use Spark majorly to handle huge data and lessen its query/operational time. Principle underlying this is obviously a distributed storage and computation. "RDD" is Resilient Distributed Data. So, it would be a advantage of parallelism in spark that makes a difference.

    – Jim Todd
    Mar 7 at 14:27











  • Point taken. Let's see if and what OP answers.

    – Laurenz Albe
    Mar 7 at 15:37












  • 1





    This is just hand-waving. Unless the query and its execution plan are known, nobody can say what the problem is. What I object to is the assumption that doing the work in the application will be faster than doing it in SQL. While there are certainly cases where this is true, my personal experience says that it is usually the other way around.

    – Laurenz Albe
    Mar 7 at 7:02











  • A (complex)SQL on top of a relational database is comparatively slower to a pyspark operation that utilizes distributed computing technique(multiple nodes). Just like, counting the number of pages by a single person vs counting equal chunks of the book by 10 people and aggregating it, Isnt it faster? Thats how I articulated the 2nd point in my answer.

    – Jim Todd
    Mar 7 at 9:35






  • 1





    You are potentially right, if the workload can conveniently parallelized in the application in a way that PostgreSQL cannot do better with its internal parallel query. A big "if", since we know nothing about the query in question.

    – Laurenz Albe
    Mar 7 at 11:00











  • The "if" could be trivial here. Because, people use Spark majorly to handle huge data and lessen its query/operational time. Principle underlying this is obviously a distributed storage and computation. "RDD" is Resilient Distributed Data. So, it would be a advantage of parallelism in spark that makes a difference.

    – Jim Todd
    Mar 7 at 14:27











  • Point taken. Let's see if and what OP answers.

    – Laurenz Albe
    Mar 7 at 15:37







1




1





This is just hand-waving. Unless the query and its execution plan are known, nobody can say what the problem is. What I object to is the assumption that doing the work in the application will be faster than doing it in SQL. While there are certainly cases where this is true, my personal experience says that it is usually the other way around.

– Laurenz Albe
Mar 7 at 7:02





This is just hand-waving. Unless the query and its execution plan are known, nobody can say what the problem is. What I object to is the assumption that doing the work in the application will be faster than doing it in SQL. While there are certainly cases where this is true, my personal experience says that it is usually the other way around.

– Laurenz Albe
Mar 7 at 7:02













A (complex)SQL on top of a relational database is comparatively slower to a pyspark operation that utilizes distributed computing technique(multiple nodes). Just like, counting the number of pages by a single person vs counting equal chunks of the book by 10 people and aggregating it, Isnt it faster? Thats how I articulated the 2nd point in my answer.

– Jim Todd
Mar 7 at 9:35





A (complex)SQL on top of a relational database is comparatively slower to a pyspark operation that utilizes distributed computing technique(multiple nodes). Just like, counting the number of pages by a single person vs counting equal chunks of the book by 10 people and aggregating it, Isnt it faster? Thats how I articulated the 2nd point in my answer.

– Jim Todd
Mar 7 at 9:35




1




1





You are potentially right, if the workload can conveniently parallelized in the application in a way that PostgreSQL cannot do better with its internal parallel query. A big "if", since we know nothing about the query in question.

– Laurenz Albe
Mar 7 at 11:00





You are potentially right, if the workload can conveniently parallelized in the application in a way that PostgreSQL cannot do better with its internal parallel query. A big "if", since we know nothing about the query in question.

– Laurenz Albe
Mar 7 at 11:00













The "if" could be trivial here. Because, people use Spark majorly to handle huge data and lessen its query/operational time. Principle underlying this is obviously a distributed storage and computation. "RDD" is Resilient Distributed Data. So, it would be a advantage of parallelism in spark that makes a difference.

– Jim Todd
Mar 7 at 14:27





The "if" could be trivial here. Because, people use Spark majorly to handle huge data and lessen its query/operational time. Principle underlying this is obviously a distributed storage and computation. "RDD" is Resilient Distributed Data. So, it would be a advantage of parallelism in spark that makes a difference.

– Jim Todd
Mar 7 at 14:27













Point taken. Let's see if and what OP answers.

– Laurenz Albe
Mar 7 at 15:37





Point taken. Let's see if and what OP answers.

– Laurenz Albe
Mar 7 at 15:37



















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55036256%2fhow-to-make-my-postgres-query-faster-using-apache-spark%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Can't initialize raids on a new ASUS Prime B360M-A motherboard2019 Community Moderator ElectionSimilar to RAID config yet more like mirroring solution?Can't get motherboard serial numberWhy does the BIOS entry point start with a WBINVD instruction?UEFI performance Asus Maximus V Extreme

Identity Server 4 is not redirecting to Angular app after login2019 Community Moderator ElectionIdentity Server 4 and dockerIdentityserver implicit flow unauthorized_clientIdentityServer Hybrid Flow - Access Token is null after user successful loginIdentity Server to MVC client : Page Redirect After loginLogin with Steam OpenId(oidc-client-js)Identity Server 4+.NET Core 2.0 + IdentityIdentityServer4 post-login redirect not working in Edge browserCall to IdentityServer4 generates System.NullReferenceException: Object reference not set to an instance of an objectIdentityServer4 without HTTPS not workingHow to get Authorization code from identity server without login form

2005 Ahvaz unrest Contents Background Causes Casualties Aftermath See also References Navigation menue"At Least 10 Are Killed by Bombs in Iran""Iran"Archived"Arab-Iranians in Iran to make April 15 'Day of Fury'"State of Mind, State of Order: Reactions to Ethnic Unrest in the Islamic Republic of Iran.10.1111/j.1754-9469.2008.00028.x"Iran hangs Arab separatists"Iran Overview from ArchivedConstitution of the Islamic Republic of Iran"Tehran puzzled by forged 'riots' letter""Iran and its minorities: Down in the second class""Iran: Handling Of Ahvaz Unrest Could End With Televised Confessions""Bombings Rock Iran Ahead of Election""Five die in Iran ethnic clashes""Iran: Need for restraint as anniversary of unrest in Khuzestan approaches"Archived"Iranian Sunni protesters killed in clashes with security forces"Archived