How to make my Postgres Query faster using Apache Spark?2019 Community Moderator ElectionHow to reset postgres' primary key sequence when it falls out of sync?PostgreSQL: How to make “case-insensitive” queryHibernate + Postgres: simple query with bad performanceApache SPARK with SQLContext:: IndexError'where' in apache sparkApache Spark OutOfMemoryError (HeapSpace)How to make good reproducible Apache Spark examplesApache Spark Pivot Query Stuck (PySpark)pyspark.sql.utils.ParseException: u"nextraneous > input 'xxx' expecting ')', ','Pyspark Dataframe API or Spark.sql for joins?
Is divide-by-zero a security vulnerability?
Why couldn't the separatists legally leave the Republic?
Getting the || sign while using Kurier
Was it really inappropriate to write a pull request for the company I interviewed with?
Source permutation
Making a kiddush for a girl that has hard time finding shidduch
What is the generally accepted pronunciation of “topoi”?
Why is gluten-free baking possible?
What is Tony Stark injecting into himself in Iron Man 3?
How exactly does an Ethernet collision happen in the cable, since nodes use different circuits for Tx and Rx?
Proving a statement about real numbers
What would be the most expensive material to an intergalactic society?
What are some noteworthy "mic-drop" moments in math?
How can I find out information about a service?
After `ssh` without `-X` to a machine, is it possible to change `$DISPLAY` to make it work like `ssh -X`?
Windows Server Data Center Edition - Unlimited Virtual Machines
In the late 1940’s to early 1950’s what technology was available that could melt a LOT of ice?
Finitely many repeated replacements
From an axiomatic set theoric approach why can we take uncountable unions?
Outlet with 3 sets of wires
Does an unused member variable take up memory?
Are small insurances worth it?
What is this diamond of every day?
Trig Subsitution When There's No Square Root
How to make my Postgres Query faster using Apache Spark?
2019 Community Moderator ElectionHow to reset postgres' primary key sequence when it falls out of sync?PostgreSQL: How to make “case-insensitive” queryHibernate + Postgres: simple query with bad performanceApache SPARK with SQLContext:: IndexError'where' in apache sparkApache Spark OutOfMemoryError (HeapSpace)How to make good reproducible Apache Spark examplesApache Spark Pivot Query Stuck (PySpark)pyspark.sql.utils.ParseException: u"nextraneous > input 'xxx' expecting ')', ','Pyspark Dataframe API or Spark.sql for joins?
How to minimize my query execution time using pyspark?
I am using Postgres Database,
And spark installed in my local machine having 10GB RAM
Query Execution time in PgAdmin - 10 Sec
Query Execution time in Pyspark - 10 Sec
Find below is my pyspark code
from pyspark.sql import DataFrameReader
url = "jdbc:postgresql://168.23.233.4:5432/MyDatabase"
properties =
"driver": "org.postgresql.Driver",
"user": "postgres",
"password": "123"
df = sqlContext.read.jdbc(url=url,table="(select.. very big query limit 10) AS t", properties=properties)
df.show()
Query has to join more than 13 tables each table has 1 million rows.
Please help me to faster query using Spark.
I have try this based on this blog enter link description here.
Find Below query running inside the pyspark code,
select '2019-02-27' as "Attendance_date",e.id as e_id,concat(e.first_name::text, e.last_name::text) as "Employee_name",e.emp_id as "Employee_id",
e.user_id as "User_id",e.customer_id,att.id attendance_id, al.id as Attendance_logs_id,aa1.id as attendance_approval_id,
e.client_emp_id as "Client_employee_id", e.contact_no as "Contact_no",
att.imei as ImeiNumber,e.email_id as "Email_id",
concat(man.first_name::text, man.last_name::text) as "Manager_name", man.id as "Manager_id",
att.Uniform,att.Samsung_Logo,att.Blue_Color_Check,att.Blue_Color_Percentage,
al.Face_Detection_Flag,rl.role_name as "Role_name",b.branch_name as "Branch_name",
b.branch_code as "Branch_code",cty.city_name as "City",sm.state as "State",
gsv1.name as "Geo_Country",gsv2.name as "Geo_State"
,sh.shift_name as "Shift_name",sh.id as shift_id
,((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME)
as "Check_in_time"
,al.check_in_lat as "Check_in_latitude", al.check_in_long as "Check_in_longitude",
(select string_agg(value, ', ') from json_each_text(al.check_in_address::json))as "Check_in_address",
att.check_in_late as "Check_in_late_remarks",al.check_in_distance_variation as "Check_in_distance",al.check_in_selfie as "Check_in_selfie",
case when @aa1.approval_flag = 2 then ch_in.attendance_reason end as "Check_in_rejection_remarks",qc_ch_in.attendance_reason
as "Check_in_qc_review",
case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved'
when @aa2.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 2 THEN 'Rejected' when @aa2.approval_flag = 0
THEN 'Pending' else null END as "Check_out_status",
case when att.attendance_type='P' and @att.approval_flag = 1 or att.attendance_type='L' and el.approval_flag=1 or
att.attendance_type='H' and eh.approval_flag=1 or att.attendance_type='M' and em.approval_flag=1 or
att.attendance_type='W' and ew.approval_flag=1 then 'Approved'
when att.attendance_type='P' and @att.approval_flag = 0 or att.attendance_type='L' and el.approval_flag=0 or
att.attendance_type='H' and eh.approval_flag=0 or att.attendance_type='M' and em.approval_flag=0 or
att.attendance_type='W' and ew.approval_flag=0 then 'Waiting for Approval'
when att.attendance_type='P' and el.approval_flag=2 or att.attendance_type='H' and eh.approval_flag=2 or
att.attendance_type='M' and em.approval_flag=2 or att.attendance_type='W' and ew.approval_flag=2 then 'Rejected'
when att.attendance_type='P' and @att.approval_flag is null then '' else 'Waiting for Approval' end as "TL approval status",
case when att.attendance_type='P' then 'Marked'
when att.attendance_type='L' then 'Marked' when att.attendance_type='HL' or att.attendance_type='HP' then 'Marked'
when att.attendance_type='W' or ewo.weekoff_id is not null then 'Marked' when (e.customer_id is null and cehv.id is not null)
or (e.customer_id is not null and ehv.id is not null) then 'Holiday'when el.employee_id is not null then 'Marked'
when eh.employee_id is not null then 'Marked' when ew.employee_id is not null then 'Marked'
when em.employee_id is not null then 'Marked' else 'Not Marked' end "Status",
case when att.attendance_type='P'
then check_in.attendance_reason when att.attendance_type='L' then lt.leave_type_name when el.employee_id is not null
then lt.leave_type_name when att.attendance_type='HL' or att.attendance_type='HP' then 'Half Day'
when att.attendance_type='W' or ewo.weekoff_id is not null then 'Week off' when (e.customer_id is null and cehv.id is not null
) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when eh.employee_id is not null then 'Holiday' when
ew.employee_id is not null then 'Week off' when em.employee_id is not null then 'Marketoff' else 'Absent'
end as "Attendance_reason",
case when att.on_behalf_attendance is not null then concat(man_behalf.first_name::text,
man_behalf.last_name::text) else null end as "Onbehalf_name",att.Check_Out_Qc_Review,att.Check_Out_Distance,
al.Check_Out_Address
from employees e
left join employee_applied_holidays eh on eh.employee_id=e.id and date('2019-02-27') between eh.from_date and eh.to_date
left join employee_applied_weekoffs ew on ew.employee_id=e.id and date('2019-02-27') between ew.from_date and ew.to_date
left join employee_applied_marketoffs em on em.employee_id=e.id and date('2019-02-27') between em.from_date and em.to_date
inner join users u on u.ref_id = e.id and u.customer_id=200
inner join user_role_groups urg on u.id = urg.user_id and urg.active_flag = 1
inner join attendance_setups ass on ass.role_group_id = urg.role_group_id
left join attendances att on att.employee_id = e.id and att.start_date = '2019-02-27' and att.delete_flag = 0
left join employee_leaves el ON el.id=(select id from employee_leaves el2 where el2.employee_id=e.id and
el2.active_flag=1 and date('2019-02-27') between el2.from_date and el2.to_date order by id desc limit 1)
left join leave_types lt ON lt.id=(select leave_type from employee_leaves el where el.employee_id=e.id and
el.active_flag=1 and date('2019-02-27') between el.from_date and el.to_date order by id desc limit 1)
left join attendance_logs al on al.attendance_id = att.id and al.attendance_flag = 1
left join attendance_approvals aa1 on al.id = aa1.attendance_log_id and aa1.action = 1 and aa1.active_flag = 1
left join attendance_approvals aa2 on al.id = aa2.attendance_log_id and aa1.action = 2 and aa2.active_flag = 1
inner join branches b on b.id = e.branch_id left join employees man on man.id = e.manager_id
left join employees man_behalf on man_behalf.id = att.on_behalf_attendance
left join employee_weekoff ewo on e.id = ewo.emp_id and date_part('dow','2019-02-27'::TIMESTAMP)+1 = ewo.weekoff_id and
ewo.active_flag =1 left join employee_holidays_view ehv on e.id = ehv.id and ehv.holiday_date = '2019-02-27'
left join company_employee_holidays_view cehv on e.id = cehv.id and ehv.holiday_date = '2019-02-27'
inner join roles rl on rl.id = e.role_id inner join cities cty on cty.id = b.city_id
inner join states on states.id = b.state_id inner join state_master sm on sm.id = states.state_id
inner join countries on countries.id = b.country_id inner join country_master cm on cm.country_id = countries.country_id
left join shifts sh on sh.id = att.shift_id left join attendance_reasons ch_in on ch_in.id = aa1.reason_id
left join sessions se on sh.id=se.shift_id left join attendance_reasons ch_out on ch_out.id = aa2.reason_id
left join attendance_reasons qc_ch_in on qc_ch_in.id = att.check_in_qc_review
left join attendance_reasons qc_ch_out on qc_ch_out.id = att.check_out_qc_review
left join attendance_reasons check_in on check_in.id = al.reason_id
left join time_zones tz on b.timezone = tz.time_zone inner join geo_outlet_mapping gom
on b.id = gom.outlet_id left join geo_structure_values gsv1 on gsv1.id = gom.level1 left join
geo_structure_values gsv2 on gsv2.id = gom.level2 left join geo_structure_values gsv3 on gsv3.id = gom.level3
where e.customer_id=200
group by concat(e.first_name::text, e.last_name::text) ,e.emp_id ,e.user_id ,e.client_emp_id , e.contact_no , e.email_id,e.profile_picture,(select string_agg(role_group_name, ', ') from role_group where role_group_id = any((select array_agg(role_group_id) from user_role_groups where user_id = u.id and active_flag = 1)::int[])),concat(man.first_name::text, man.last_name::text), rl.role_name,b.branch_name,b.branch_code,cty.city_name,sm.state,cm.country,gsv1.name,gsv2.name,case when @ass.reference_point = 1 THEN b.latitude else e.latitude END,case when @ass.reference_point = 1 THEN b.longitude else e.longitude END,sh.shift_name,sh.start_time, sh.end_time,((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME),case
when current_date='2019-02-27' and sh.end_time<cast(current_time as time without time zone) then null else (case when se.check_out_flag=1 then cast(att.total_hours as interval) when se.check_out_flag=0 then sh.end_time-((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME) end) end,al.check_in_lat, al.check_in_long,(select string_agg(value, ', ') from json_each_text(al.check_in_address::json)),att.check_in_late,al.check_in_distance_variation,al.check_in_selfie,case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved' when @aa1.approval_flag = 1 THEN 'Approved' when @aa1.approval_flag = 2 THEN 'Rejected' when @aa1.approval_flag = 0 THEN 'Pending' else null END,case when @aa1.approval_flag = 2 then ch_in.attendance_reason end,qc_ch_in.attendance_reason,case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 2 THEN 'Rejected' when @aa2.approval_flag = 0 THEN 'Pending' else null END,
case when att.attendance_type='P' and @att.approval_flag = 1 or att.attendance_type='L' and el.approval_flag=1 or
att.attendance_type='H' and eh.approval_flag=1 or att.attendance_type='M' and em.approval_flag=1 or
att.attendance_type='W' and ew.approval_flag=1 then 'Approved'
when att.attendance_type='P' and @att.approval_flag = 0 or att.attendance_type='L' and el.approval_flag=0 or
att.attendance_type='H' and eh.approval_flag=0 or att.attendance_type='M' and em.approval_flag=0 or
att.attendance_type='W' and ew.approval_flag=0 then 'Waiting for Approval'
when att.attendance_type='P' and el.approval_flag=2 or att.attendance_type='H' and eh.approval_flag=2 or
att.attendance_type='M' and em.approval_flag=2 or att.attendance_type='W' and ew.approval_flag=2 then 'Rejected'
when att.attendance_type='P' and @att.approval_flag is null then '' else 'Waiting for Approval' end,
case when att.attendance_type='P' then 'Marked' when att.attendance_type='L' then 'Marked'
when att.attendance_type='HL' or att.attendance_type='HP' then 'Marked' when att.attendance_type='W'
or ewo.weekoff_id is not null then 'Marked' when (e.customer_id is null and cehv.id is not null) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when el.employee_id is not null then 'Marked' when eh.employee_id is not null then 'Marked' when ew.employee_id is not null then 'Marked' when em.employee_id is not null then 'Marked' else 'Not Marked' end,case when att.attendance_type='P' then check_in.attendance_reason when att.attendance_type='L' then lt.leave_type_name when el.employee_id is not null then lt.leave_type_name when att.attendance_type='HL' or att.attendance_type='HP' then 'Half Day' when att.attendance_type='W' or ewo.weekoff_id is not null then 'Week off' when (e.customer_id is null and cehv.id is not null) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when eh.employee_id is not null then 'Holiday' when ew.employee_id is not null then 'Week off' when em.employee_id is not null then 'Marketoff' else 'Absent' end ,case when att.on_behalf_attendance is not null then concat(man_behalf.first_name::text,man_behalf.last_name::text) else null end
,att.id,e.customer_id,al.id,aa1.id,att.imei,man.id,att.Uniform,att.Samsung_Logo,att.Blue_Color_Check,att.Blue_Color_Percentage,
al.Face_Detection_Flag,att.Check_Out_Qc_Review,att.Check_Out_Distance,sh.id,att.id,e.id;
postgresql apache-spark pyspark apache-spark-sql pyspark-sql
add a comment |
How to minimize my query execution time using pyspark?
I am using Postgres Database,
And spark installed in my local machine having 10GB RAM
Query Execution time in PgAdmin - 10 Sec
Query Execution time in Pyspark - 10 Sec
Find below is my pyspark code
from pyspark.sql import DataFrameReader
url = "jdbc:postgresql://168.23.233.4:5432/MyDatabase"
properties =
"driver": "org.postgresql.Driver",
"user": "postgres",
"password": "123"
df = sqlContext.read.jdbc(url=url,table="(select.. very big query limit 10) AS t", properties=properties)
df.show()
Query has to join more than 13 tables each table has 1 million rows.
Please help me to faster query using Spark.
I have try this based on this blog enter link description here.
Find Below query running inside the pyspark code,
select '2019-02-27' as "Attendance_date",e.id as e_id,concat(e.first_name::text, e.last_name::text) as "Employee_name",e.emp_id as "Employee_id",
e.user_id as "User_id",e.customer_id,att.id attendance_id, al.id as Attendance_logs_id,aa1.id as attendance_approval_id,
e.client_emp_id as "Client_employee_id", e.contact_no as "Contact_no",
att.imei as ImeiNumber,e.email_id as "Email_id",
concat(man.first_name::text, man.last_name::text) as "Manager_name", man.id as "Manager_id",
att.Uniform,att.Samsung_Logo,att.Blue_Color_Check,att.Blue_Color_Percentage,
al.Face_Detection_Flag,rl.role_name as "Role_name",b.branch_name as "Branch_name",
b.branch_code as "Branch_code",cty.city_name as "City",sm.state as "State",
gsv1.name as "Geo_Country",gsv2.name as "Geo_State"
,sh.shift_name as "Shift_name",sh.id as shift_id
,((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME)
as "Check_in_time"
,al.check_in_lat as "Check_in_latitude", al.check_in_long as "Check_in_longitude",
(select string_agg(value, ', ') from json_each_text(al.check_in_address::json))as "Check_in_address",
att.check_in_late as "Check_in_late_remarks",al.check_in_distance_variation as "Check_in_distance",al.check_in_selfie as "Check_in_selfie",
case when @aa1.approval_flag = 2 then ch_in.attendance_reason end as "Check_in_rejection_remarks",qc_ch_in.attendance_reason
as "Check_in_qc_review",
case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved'
when @aa2.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 2 THEN 'Rejected' when @aa2.approval_flag = 0
THEN 'Pending' else null END as "Check_out_status",
case when att.attendance_type='P' and @att.approval_flag = 1 or att.attendance_type='L' and el.approval_flag=1 or
att.attendance_type='H' and eh.approval_flag=1 or att.attendance_type='M' and em.approval_flag=1 or
att.attendance_type='W' and ew.approval_flag=1 then 'Approved'
when att.attendance_type='P' and @att.approval_flag = 0 or att.attendance_type='L' and el.approval_flag=0 or
att.attendance_type='H' and eh.approval_flag=0 or att.attendance_type='M' and em.approval_flag=0 or
att.attendance_type='W' and ew.approval_flag=0 then 'Waiting for Approval'
when att.attendance_type='P' and el.approval_flag=2 or att.attendance_type='H' and eh.approval_flag=2 or
att.attendance_type='M' and em.approval_flag=2 or att.attendance_type='W' and ew.approval_flag=2 then 'Rejected'
when att.attendance_type='P' and @att.approval_flag is null then '' else 'Waiting for Approval' end as "TL approval status",
case when att.attendance_type='P' then 'Marked'
when att.attendance_type='L' then 'Marked' when att.attendance_type='HL' or att.attendance_type='HP' then 'Marked'
when att.attendance_type='W' or ewo.weekoff_id is not null then 'Marked' when (e.customer_id is null and cehv.id is not null)
or (e.customer_id is not null and ehv.id is not null) then 'Holiday'when el.employee_id is not null then 'Marked'
when eh.employee_id is not null then 'Marked' when ew.employee_id is not null then 'Marked'
when em.employee_id is not null then 'Marked' else 'Not Marked' end "Status",
case when att.attendance_type='P'
then check_in.attendance_reason when att.attendance_type='L' then lt.leave_type_name when el.employee_id is not null
then lt.leave_type_name when att.attendance_type='HL' or att.attendance_type='HP' then 'Half Day'
when att.attendance_type='W' or ewo.weekoff_id is not null then 'Week off' when (e.customer_id is null and cehv.id is not null
) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when eh.employee_id is not null then 'Holiday' when
ew.employee_id is not null then 'Week off' when em.employee_id is not null then 'Marketoff' else 'Absent'
end as "Attendance_reason",
case when att.on_behalf_attendance is not null then concat(man_behalf.first_name::text,
man_behalf.last_name::text) else null end as "Onbehalf_name",att.Check_Out_Qc_Review,att.Check_Out_Distance,
al.Check_Out_Address
from employees e
left join employee_applied_holidays eh on eh.employee_id=e.id and date('2019-02-27') between eh.from_date and eh.to_date
left join employee_applied_weekoffs ew on ew.employee_id=e.id and date('2019-02-27') between ew.from_date and ew.to_date
left join employee_applied_marketoffs em on em.employee_id=e.id and date('2019-02-27') between em.from_date and em.to_date
inner join users u on u.ref_id = e.id and u.customer_id=200
inner join user_role_groups urg on u.id = urg.user_id and urg.active_flag = 1
inner join attendance_setups ass on ass.role_group_id = urg.role_group_id
left join attendances att on att.employee_id = e.id and att.start_date = '2019-02-27' and att.delete_flag = 0
left join employee_leaves el ON el.id=(select id from employee_leaves el2 where el2.employee_id=e.id and
el2.active_flag=1 and date('2019-02-27') between el2.from_date and el2.to_date order by id desc limit 1)
left join leave_types lt ON lt.id=(select leave_type from employee_leaves el where el.employee_id=e.id and
el.active_flag=1 and date('2019-02-27') between el.from_date and el.to_date order by id desc limit 1)
left join attendance_logs al on al.attendance_id = att.id and al.attendance_flag = 1
left join attendance_approvals aa1 on al.id = aa1.attendance_log_id and aa1.action = 1 and aa1.active_flag = 1
left join attendance_approvals aa2 on al.id = aa2.attendance_log_id and aa1.action = 2 and aa2.active_flag = 1
inner join branches b on b.id = e.branch_id left join employees man on man.id = e.manager_id
left join employees man_behalf on man_behalf.id = att.on_behalf_attendance
left join employee_weekoff ewo on e.id = ewo.emp_id and date_part('dow','2019-02-27'::TIMESTAMP)+1 = ewo.weekoff_id and
ewo.active_flag =1 left join employee_holidays_view ehv on e.id = ehv.id and ehv.holiday_date = '2019-02-27'
left join company_employee_holidays_view cehv on e.id = cehv.id and ehv.holiday_date = '2019-02-27'
inner join roles rl on rl.id = e.role_id inner join cities cty on cty.id = b.city_id
inner join states on states.id = b.state_id inner join state_master sm on sm.id = states.state_id
inner join countries on countries.id = b.country_id inner join country_master cm on cm.country_id = countries.country_id
left join shifts sh on sh.id = att.shift_id left join attendance_reasons ch_in on ch_in.id = aa1.reason_id
left join sessions se on sh.id=se.shift_id left join attendance_reasons ch_out on ch_out.id = aa2.reason_id
left join attendance_reasons qc_ch_in on qc_ch_in.id = att.check_in_qc_review
left join attendance_reasons qc_ch_out on qc_ch_out.id = att.check_out_qc_review
left join attendance_reasons check_in on check_in.id = al.reason_id
left join time_zones tz on b.timezone = tz.time_zone inner join geo_outlet_mapping gom
on b.id = gom.outlet_id left join geo_structure_values gsv1 on gsv1.id = gom.level1 left join
geo_structure_values gsv2 on gsv2.id = gom.level2 left join geo_structure_values gsv3 on gsv3.id = gom.level3
where e.customer_id=200
group by concat(e.first_name::text, e.last_name::text) ,e.emp_id ,e.user_id ,e.client_emp_id , e.contact_no , e.email_id,e.profile_picture,(select string_agg(role_group_name, ', ') from role_group where role_group_id = any((select array_agg(role_group_id) from user_role_groups where user_id = u.id and active_flag = 1)::int[])),concat(man.first_name::text, man.last_name::text), rl.role_name,b.branch_name,b.branch_code,cty.city_name,sm.state,cm.country,gsv1.name,gsv2.name,case when @ass.reference_point = 1 THEN b.latitude else e.latitude END,case when @ass.reference_point = 1 THEN b.longitude else e.longitude END,sh.shift_name,sh.start_time, sh.end_time,((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME),case
when current_date='2019-02-27' and sh.end_time<cast(current_time as time without time zone) then null else (case when se.check_out_flag=1 then cast(att.total_hours as interval) when se.check_out_flag=0 then sh.end_time-((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME) end) end,al.check_in_lat, al.check_in_long,(select string_agg(value, ', ') from json_each_text(al.check_in_address::json)),att.check_in_late,al.check_in_distance_variation,al.check_in_selfie,case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved' when @aa1.approval_flag = 1 THEN 'Approved' when @aa1.approval_flag = 2 THEN 'Rejected' when @aa1.approval_flag = 0 THEN 'Pending' else null END,case when @aa1.approval_flag = 2 then ch_in.attendance_reason end,qc_ch_in.attendance_reason,case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 2 THEN 'Rejected' when @aa2.approval_flag = 0 THEN 'Pending' else null END,
case when att.attendance_type='P' and @att.approval_flag = 1 or att.attendance_type='L' and el.approval_flag=1 or
att.attendance_type='H' and eh.approval_flag=1 or att.attendance_type='M' and em.approval_flag=1 or
att.attendance_type='W' and ew.approval_flag=1 then 'Approved'
when att.attendance_type='P' and @att.approval_flag = 0 or att.attendance_type='L' and el.approval_flag=0 or
att.attendance_type='H' and eh.approval_flag=0 or att.attendance_type='M' and em.approval_flag=0 or
att.attendance_type='W' and ew.approval_flag=0 then 'Waiting for Approval'
when att.attendance_type='P' and el.approval_flag=2 or att.attendance_type='H' and eh.approval_flag=2 or
att.attendance_type='M' and em.approval_flag=2 or att.attendance_type='W' and ew.approval_flag=2 then 'Rejected'
when att.attendance_type='P' and @att.approval_flag is null then '' else 'Waiting for Approval' end,
case when att.attendance_type='P' then 'Marked' when att.attendance_type='L' then 'Marked'
when att.attendance_type='HL' or att.attendance_type='HP' then 'Marked' when att.attendance_type='W'
or ewo.weekoff_id is not null then 'Marked' when (e.customer_id is null and cehv.id is not null) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when el.employee_id is not null then 'Marked' when eh.employee_id is not null then 'Marked' when ew.employee_id is not null then 'Marked' when em.employee_id is not null then 'Marked' else 'Not Marked' end,case when att.attendance_type='P' then check_in.attendance_reason when att.attendance_type='L' then lt.leave_type_name when el.employee_id is not null then lt.leave_type_name when att.attendance_type='HL' or att.attendance_type='HP' then 'Half Day' when att.attendance_type='W' or ewo.weekoff_id is not null then 'Week off' when (e.customer_id is null and cehv.id is not null) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when eh.employee_id is not null then 'Holiday' when ew.employee_id is not null then 'Week off' when em.employee_id is not null then 'Marketoff' else 'Absent' end ,case when att.on_behalf_attendance is not null then concat(man_behalf.first_name::text,man_behalf.last_name::text) else null end
,att.id,e.customer_id,al.id,aa1.id,att.imei,man.id,att.Uniform,att.Samsung_Logo,att.Blue_Color_Check,att.Blue_Color_Percentage,
al.Face_Detection_Flag,att.Check_Out_Qc_Review,att.Check_Out_Distance,sh.id,att.id,e.id;
postgresql apache-spark pyspark apache-spark-sql pyspark-sql
I would recommend you to do 2 things. 1. Run a simple join between any two tables. 2. Run the same functionality using pyspark, after loading each table in separate RDDs and utilizing the join() function. Check if there is a difference in execution time.
– Jim Todd
Mar 7 at 17:50
add a comment |
How to minimize my query execution time using pyspark?
I am using Postgres Database,
And spark installed in my local machine having 10GB RAM
Query Execution time in PgAdmin - 10 Sec
Query Execution time in Pyspark - 10 Sec
Find below is my pyspark code
from pyspark.sql import DataFrameReader
url = "jdbc:postgresql://168.23.233.4:5432/MyDatabase"
properties =
"driver": "org.postgresql.Driver",
"user": "postgres",
"password": "123"
df = sqlContext.read.jdbc(url=url,table="(select.. very big query limit 10) AS t", properties=properties)
df.show()
Query has to join more than 13 tables each table has 1 million rows.
Please help me to faster query using Spark.
I have try this based on this blog enter link description here.
Find Below query running inside the pyspark code,
select '2019-02-27' as "Attendance_date",e.id as e_id,concat(e.first_name::text, e.last_name::text) as "Employee_name",e.emp_id as "Employee_id",
e.user_id as "User_id",e.customer_id,att.id attendance_id, al.id as Attendance_logs_id,aa1.id as attendance_approval_id,
e.client_emp_id as "Client_employee_id", e.contact_no as "Contact_no",
att.imei as ImeiNumber,e.email_id as "Email_id",
concat(man.first_name::text, man.last_name::text) as "Manager_name", man.id as "Manager_id",
att.Uniform,att.Samsung_Logo,att.Blue_Color_Check,att.Blue_Color_Percentage,
al.Face_Detection_Flag,rl.role_name as "Role_name",b.branch_name as "Branch_name",
b.branch_code as "Branch_code",cty.city_name as "City",sm.state as "State",
gsv1.name as "Geo_Country",gsv2.name as "Geo_State"
,sh.shift_name as "Shift_name",sh.id as shift_id
,((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME)
as "Check_in_time"
,al.check_in_lat as "Check_in_latitude", al.check_in_long as "Check_in_longitude",
(select string_agg(value, ', ') from json_each_text(al.check_in_address::json))as "Check_in_address",
att.check_in_late as "Check_in_late_remarks",al.check_in_distance_variation as "Check_in_distance",al.check_in_selfie as "Check_in_selfie",
case when @aa1.approval_flag = 2 then ch_in.attendance_reason end as "Check_in_rejection_remarks",qc_ch_in.attendance_reason
as "Check_in_qc_review",
case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved'
when @aa2.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 2 THEN 'Rejected' when @aa2.approval_flag = 0
THEN 'Pending' else null END as "Check_out_status",
case when att.attendance_type='P' and @att.approval_flag = 1 or att.attendance_type='L' and el.approval_flag=1 or
att.attendance_type='H' and eh.approval_flag=1 or att.attendance_type='M' and em.approval_flag=1 or
att.attendance_type='W' and ew.approval_flag=1 then 'Approved'
when att.attendance_type='P' and @att.approval_flag = 0 or att.attendance_type='L' and el.approval_flag=0 or
att.attendance_type='H' and eh.approval_flag=0 or att.attendance_type='M' and em.approval_flag=0 or
att.attendance_type='W' and ew.approval_flag=0 then 'Waiting for Approval'
when att.attendance_type='P' and el.approval_flag=2 or att.attendance_type='H' and eh.approval_flag=2 or
att.attendance_type='M' and em.approval_flag=2 or att.attendance_type='W' and ew.approval_flag=2 then 'Rejected'
when att.attendance_type='P' and @att.approval_flag is null then '' else 'Waiting for Approval' end as "TL approval status",
case when att.attendance_type='P' then 'Marked'
when att.attendance_type='L' then 'Marked' when att.attendance_type='HL' or att.attendance_type='HP' then 'Marked'
when att.attendance_type='W' or ewo.weekoff_id is not null then 'Marked' when (e.customer_id is null and cehv.id is not null)
or (e.customer_id is not null and ehv.id is not null) then 'Holiday'when el.employee_id is not null then 'Marked'
when eh.employee_id is not null then 'Marked' when ew.employee_id is not null then 'Marked'
when em.employee_id is not null then 'Marked' else 'Not Marked' end "Status",
case when att.attendance_type='P'
then check_in.attendance_reason when att.attendance_type='L' then lt.leave_type_name when el.employee_id is not null
then lt.leave_type_name when att.attendance_type='HL' or att.attendance_type='HP' then 'Half Day'
when att.attendance_type='W' or ewo.weekoff_id is not null then 'Week off' when (e.customer_id is null and cehv.id is not null
) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when eh.employee_id is not null then 'Holiday' when
ew.employee_id is not null then 'Week off' when em.employee_id is not null then 'Marketoff' else 'Absent'
end as "Attendance_reason",
case when att.on_behalf_attendance is not null then concat(man_behalf.first_name::text,
man_behalf.last_name::text) else null end as "Onbehalf_name",att.Check_Out_Qc_Review,att.Check_Out_Distance,
al.Check_Out_Address
from employees e
left join employee_applied_holidays eh on eh.employee_id=e.id and date('2019-02-27') between eh.from_date and eh.to_date
left join employee_applied_weekoffs ew on ew.employee_id=e.id and date('2019-02-27') between ew.from_date and ew.to_date
left join employee_applied_marketoffs em on em.employee_id=e.id and date('2019-02-27') between em.from_date and em.to_date
inner join users u on u.ref_id = e.id and u.customer_id=200
inner join user_role_groups urg on u.id = urg.user_id and urg.active_flag = 1
inner join attendance_setups ass on ass.role_group_id = urg.role_group_id
left join attendances att on att.employee_id = e.id and att.start_date = '2019-02-27' and att.delete_flag = 0
left join employee_leaves el ON el.id=(select id from employee_leaves el2 where el2.employee_id=e.id and
el2.active_flag=1 and date('2019-02-27') between el2.from_date and el2.to_date order by id desc limit 1)
left join leave_types lt ON lt.id=(select leave_type from employee_leaves el where el.employee_id=e.id and
el.active_flag=1 and date('2019-02-27') between el.from_date and el.to_date order by id desc limit 1)
left join attendance_logs al on al.attendance_id = att.id and al.attendance_flag = 1
left join attendance_approvals aa1 on al.id = aa1.attendance_log_id and aa1.action = 1 and aa1.active_flag = 1
left join attendance_approvals aa2 on al.id = aa2.attendance_log_id and aa1.action = 2 and aa2.active_flag = 1
inner join branches b on b.id = e.branch_id left join employees man on man.id = e.manager_id
left join employees man_behalf on man_behalf.id = att.on_behalf_attendance
left join employee_weekoff ewo on e.id = ewo.emp_id and date_part('dow','2019-02-27'::TIMESTAMP)+1 = ewo.weekoff_id and
ewo.active_flag =1 left join employee_holidays_view ehv on e.id = ehv.id and ehv.holiday_date = '2019-02-27'
left join company_employee_holidays_view cehv on e.id = cehv.id and ehv.holiday_date = '2019-02-27'
inner join roles rl on rl.id = e.role_id inner join cities cty on cty.id = b.city_id
inner join states on states.id = b.state_id inner join state_master sm on sm.id = states.state_id
inner join countries on countries.id = b.country_id inner join country_master cm on cm.country_id = countries.country_id
left join shifts sh on sh.id = att.shift_id left join attendance_reasons ch_in on ch_in.id = aa1.reason_id
left join sessions se on sh.id=se.shift_id left join attendance_reasons ch_out on ch_out.id = aa2.reason_id
left join attendance_reasons qc_ch_in on qc_ch_in.id = att.check_in_qc_review
left join attendance_reasons qc_ch_out on qc_ch_out.id = att.check_out_qc_review
left join attendance_reasons check_in on check_in.id = al.reason_id
left join time_zones tz on b.timezone = tz.time_zone inner join geo_outlet_mapping gom
on b.id = gom.outlet_id left join geo_structure_values gsv1 on gsv1.id = gom.level1 left join
geo_structure_values gsv2 on gsv2.id = gom.level2 left join geo_structure_values gsv3 on gsv3.id = gom.level3
where e.customer_id=200
group by concat(e.first_name::text, e.last_name::text) ,e.emp_id ,e.user_id ,e.client_emp_id , e.contact_no , e.email_id,e.profile_picture,(select string_agg(role_group_name, ', ') from role_group where role_group_id = any((select array_agg(role_group_id) from user_role_groups where user_id = u.id and active_flag = 1)::int[])),concat(man.first_name::text, man.last_name::text), rl.role_name,b.branch_name,b.branch_code,cty.city_name,sm.state,cm.country,gsv1.name,gsv2.name,case when @ass.reference_point = 1 THEN b.latitude else e.latitude END,case when @ass.reference_point = 1 THEN b.longitude else e.longitude END,sh.shift_name,sh.start_time, sh.end_time,((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME),case
when current_date='2019-02-27' and sh.end_time<cast(current_time as time without time zone) then null else (case when se.check_out_flag=1 then cast(att.total_hours as interval) when se.check_out_flag=0 then sh.end_time-((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME) end) end,al.check_in_lat, al.check_in_long,(select string_agg(value, ', ') from json_each_text(al.check_in_address::json)),att.check_in_late,al.check_in_distance_variation,al.check_in_selfie,case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved' when @aa1.approval_flag = 1 THEN 'Approved' when @aa1.approval_flag = 2 THEN 'Rejected' when @aa1.approval_flag = 0 THEN 'Pending' else null END,case when @aa1.approval_flag = 2 then ch_in.attendance_reason end,qc_ch_in.attendance_reason,case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 2 THEN 'Rejected' when @aa2.approval_flag = 0 THEN 'Pending' else null END,
case when att.attendance_type='P' and @att.approval_flag = 1 or att.attendance_type='L' and el.approval_flag=1 or
att.attendance_type='H' and eh.approval_flag=1 or att.attendance_type='M' and em.approval_flag=1 or
att.attendance_type='W' and ew.approval_flag=1 then 'Approved'
when att.attendance_type='P' and @att.approval_flag = 0 or att.attendance_type='L' and el.approval_flag=0 or
att.attendance_type='H' and eh.approval_flag=0 or att.attendance_type='M' and em.approval_flag=0 or
att.attendance_type='W' and ew.approval_flag=0 then 'Waiting for Approval'
when att.attendance_type='P' and el.approval_flag=2 or att.attendance_type='H' and eh.approval_flag=2 or
att.attendance_type='M' and em.approval_flag=2 or att.attendance_type='W' and ew.approval_flag=2 then 'Rejected'
when att.attendance_type='P' and @att.approval_flag is null then '' else 'Waiting for Approval' end,
case when att.attendance_type='P' then 'Marked' when att.attendance_type='L' then 'Marked'
when att.attendance_type='HL' or att.attendance_type='HP' then 'Marked' when att.attendance_type='W'
or ewo.weekoff_id is not null then 'Marked' when (e.customer_id is null and cehv.id is not null) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when el.employee_id is not null then 'Marked' when eh.employee_id is not null then 'Marked' when ew.employee_id is not null then 'Marked' when em.employee_id is not null then 'Marked' else 'Not Marked' end,case when att.attendance_type='P' then check_in.attendance_reason when att.attendance_type='L' then lt.leave_type_name when el.employee_id is not null then lt.leave_type_name when att.attendance_type='HL' or att.attendance_type='HP' then 'Half Day' when att.attendance_type='W' or ewo.weekoff_id is not null then 'Week off' when (e.customer_id is null and cehv.id is not null) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when eh.employee_id is not null then 'Holiday' when ew.employee_id is not null then 'Week off' when em.employee_id is not null then 'Marketoff' else 'Absent' end ,case when att.on_behalf_attendance is not null then concat(man_behalf.first_name::text,man_behalf.last_name::text) else null end
,att.id,e.customer_id,al.id,aa1.id,att.imei,man.id,att.Uniform,att.Samsung_Logo,att.Blue_Color_Check,att.Blue_Color_Percentage,
al.Face_Detection_Flag,att.Check_Out_Qc_Review,att.Check_Out_Distance,sh.id,att.id,e.id;
postgresql apache-spark pyspark apache-spark-sql pyspark-sql
How to minimize my query execution time using pyspark?
I am using Postgres Database,
And spark installed in my local machine having 10GB RAM
Query Execution time in PgAdmin - 10 Sec
Query Execution time in Pyspark - 10 Sec
Find below is my pyspark code
from pyspark.sql import DataFrameReader
url = "jdbc:postgresql://168.23.233.4:5432/MyDatabase"
properties =
"driver": "org.postgresql.Driver",
"user": "postgres",
"password": "123"
df = sqlContext.read.jdbc(url=url,table="(select.. very big query limit 10) AS t", properties=properties)
df.show()
Query has to join more than 13 tables each table has 1 million rows.
Please help me to faster query using Spark.
I have try this based on this blog enter link description here.
Find Below query running inside the pyspark code,
select '2019-02-27' as "Attendance_date",e.id as e_id,concat(e.first_name::text, e.last_name::text) as "Employee_name",e.emp_id as "Employee_id",
e.user_id as "User_id",e.customer_id,att.id attendance_id, al.id as Attendance_logs_id,aa1.id as attendance_approval_id,
e.client_emp_id as "Client_employee_id", e.contact_no as "Contact_no",
att.imei as ImeiNumber,e.email_id as "Email_id",
concat(man.first_name::text, man.last_name::text) as "Manager_name", man.id as "Manager_id",
att.Uniform,att.Samsung_Logo,att.Blue_Color_Check,att.Blue_Color_Percentage,
al.Face_Detection_Flag,rl.role_name as "Role_name",b.branch_name as "Branch_name",
b.branch_code as "Branch_code",cty.city_name as "City",sm.state as "State",
gsv1.name as "Geo_Country",gsv2.name as "Geo_State"
,sh.shift_name as "Shift_name",sh.id as shift_id
,((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME)
as "Check_in_time"
,al.check_in_lat as "Check_in_latitude", al.check_in_long as "Check_in_longitude",
(select string_agg(value, ', ') from json_each_text(al.check_in_address::json))as "Check_in_address",
att.check_in_late as "Check_in_late_remarks",al.check_in_distance_variation as "Check_in_distance",al.check_in_selfie as "Check_in_selfie",
case when @aa1.approval_flag = 2 then ch_in.attendance_reason end as "Check_in_rejection_remarks",qc_ch_in.attendance_reason
as "Check_in_qc_review",
case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved'
when @aa2.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 2 THEN 'Rejected' when @aa2.approval_flag = 0
THEN 'Pending' else null END as "Check_out_status",
case when att.attendance_type='P' and @att.approval_flag = 1 or att.attendance_type='L' and el.approval_flag=1 or
att.attendance_type='H' and eh.approval_flag=1 or att.attendance_type='M' and em.approval_flag=1 or
att.attendance_type='W' and ew.approval_flag=1 then 'Approved'
when att.attendance_type='P' and @att.approval_flag = 0 or att.attendance_type='L' and el.approval_flag=0 or
att.attendance_type='H' and eh.approval_flag=0 or att.attendance_type='M' and em.approval_flag=0 or
att.attendance_type='W' and ew.approval_flag=0 then 'Waiting for Approval'
when att.attendance_type='P' and el.approval_flag=2 or att.attendance_type='H' and eh.approval_flag=2 or
att.attendance_type='M' and em.approval_flag=2 or att.attendance_type='W' and ew.approval_flag=2 then 'Rejected'
when att.attendance_type='P' and @att.approval_flag is null then '' else 'Waiting for Approval' end as "TL approval status",
case when att.attendance_type='P' then 'Marked'
when att.attendance_type='L' then 'Marked' when att.attendance_type='HL' or att.attendance_type='HP' then 'Marked'
when att.attendance_type='W' or ewo.weekoff_id is not null then 'Marked' when (e.customer_id is null and cehv.id is not null)
or (e.customer_id is not null and ehv.id is not null) then 'Holiday'when el.employee_id is not null then 'Marked'
when eh.employee_id is not null then 'Marked' when ew.employee_id is not null then 'Marked'
when em.employee_id is not null then 'Marked' else 'Not Marked' end "Status",
case when att.attendance_type='P'
then check_in.attendance_reason when att.attendance_type='L' then lt.leave_type_name when el.employee_id is not null
then lt.leave_type_name when att.attendance_type='HL' or att.attendance_type='HP' then 'Half Day'
when att.attendance_type='W' or ewo.weekoff_id is not null then 'Week off' when (e.customer_id is null and cehv.id is not null
) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when eh.employee_id is not null then 'Holiday' when
ew.employee_id is not null then 'Week off' when em.employee_id is not null then 'Marketoff' else 'Absent'
end as "Attendance_reason",
case when att.on_behalf_attendance is not null then concat(man_behalf.first_name::text,
man_behalf.last_name::text) else null end as "Onbehalf_name",att.Check_Out_Qc_Review,att.Check_Out_Distance,
al.Check_Out_Address
from employees e
left join employee_applied_holidays eh on eh.employee_id=e.id and date('2019-02-27') between eh.from_date and eh.to_date
left join employee_applied_weekoffs ew on ew.employee_id=e.id and date('2019-02-27') between ew.from_date and ew.to_date
left join employee_applied_marketoffs em on em.employee_id=e.id and date('2019-02-27') between em.from_date and em.to_date
inner join users u on u.ref_id = e.id and u.customer_id=200
inner join user_role_groups urg on u.id = urg.user_id and urg.active_flag = 1
inner join attendance_setups ass on ass.role_group_id = urg.role_group_id
left join attendances att on att.employee_id = e.id and att.start_date = '2019-02-27' and att.delete_flag = 0
left join employee_leaves el ON el.id=(select id from employee_leaves el2 where el2.employee_id=e.id and
el2.active_flag=1 and date('2019-02-27') between el2.from_date and el2.to_date order by id desc limit 1)
left join leave_types lt ON lt.id=(select leave_type from employee_leaves el where el.employee_id=e.id and
el.active_flag=1 and date('2019-02-27') between el.from_date and el.to_date order by id desc limit 1)
left join attendance_logs al on al.attendance_id = att.id and al.attendance_flag = 1
left join attendance_approvals aa1 on al.id = aa1.attendance_log_id and aa1.action = 1 and aa1.active_flag = 1
left join attendance_approvals aa2 on al.id = aa2.attendance_log_id and aa1.action = 2 and aa2.active_flag = 1
inner join branches b on b.id = e.branch_id left join employees man on man.id = e.manager_id
left join employees man_behalf on man_behalf.id = att.on_behalf_attendance
left join employee_weekoff ewo on e.id = ewo.emp_id and date_part('dow','2019-02-27'::TIMESTAMP)+1 = ewo.weekoff_id and
ewo.active_flag =1 left join employee_holidays_view ehv on e.id = ehv.id and ehv.holiday_date = '2019-02-27'
left join company_employee_holidays_view cehv on e.id = cehv.id and ehv.holiday_date = '2019-02-27'
inner join roles rl on rl.id = e.role_id inner join cities cty on cty.id = b.city_id
inner join states on states.id = b.state_id inner join state_master sm on sm.id = states.state_id
inner join countries on countries.id = b.country_id inner join country_master cm on cm.country_id = countries.country_id
left join shifts sh on sh.id = att.shift_id left join attendance_reasons ch_in on ch_in.id = aa1.reason_id
left join sessions se on sh.id=se.shift_id left join attendance_reasons ch_out on ch_out.id = aa2.reason_id
left join attendance_reasons qc_ch_in on qc_ch_in.id = att.check_in_qc_review
left join attendance_reasons qc_ch_out on qc_ch_out.id = att.check_out_qc_review
left join attendance_reasons check_in on check_in.id = al.reason_id
left join time_zones tz on b.timezone = tz.time_zone inner join geo_outlet_mapping gom
on b.id = gom.outlet_id left join geo_structure_values gsv1 on gsv1.id = gom.level1 left join
geo_structure_values gsv2 on gsv2.id = gom.level2 left join geo_structure_values gsv3 on gsv3.id = gom.level3
where e.customer_id=200
group by concat(e.first_name::text, e.last_name::text) ,e.emp_id ,e.user_id ,e.client_emp_id , e.contact_no , e.email_id,e.profile_picture,(select string_agg(role_group_name, ', ') from role_group where role_group_id = any((select array_agg(role_group_id) from user_role_groups where user_id = u.id and active_flag = 1)::int[])),concat(man.first_name::text, man.last_name::text), rl.role_name,b.branch_name,b.branch_code,cty.city_name,sm.state,cm.country,gsv1.name,gsv2.name,case when @ass.reference_point = 1 THEN b.latitude else e.latitude END,case when @ass.reference_point = 1 THEN b.longitude else e.longitude END,sh.shift_name,sh.start_time, sh.end_time,((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME),case
when current_date='2019-02-27' and sh.end_time<cast(current_time as time without time zone) then null else (case when se.check_out_flag=1 then cast(att.total_hours as interval) when se.check_out_flag=0 then sh.end_time-((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME) end) end,al.check_in_lat, al.check_in_long,(select string_agg(value, ', ') from json_each_text(al.check_in_address::json)),att.check_in_late,al.check_in_distance_variation,al.check_in_selfie,case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved' when @aa1.approval_flag = 1 THEN 'Approved' when @aa1.approval_flag = 2 THEN 'Rejected' when @aa1.approval_flag = 0 THEN 'Pending' else null END,case when @aa1.approval_flag = 2 then ch_in.attendance_reason end,qc_ch_in.attendance_reason,case when @att.regularize_flag = 1 or @att.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 1 THEN 'Approved' when @aa2.approval_flag = 2 THEN 'Rejected' when @aa2.approval_flag = 0 THEN 'Pending' else null END,
case when att.attendance_type='P' and @att.approval_flag = 1 or att.attendance_type='L' and el.approval_flag=1 or
att.attendance_type='H' and eh.approval_flag=1 or att.attendance_type='M' and em.approval_flag=1 or
att.attendance_type='W' and ew.approval_flag=1 then 'Approved'
when att.attendance_type='P' and @att.approval_flag = 0 or att.attendance_type='L' and el.approval_flag=0 or
att.attendance_type='H' and eh.approval_flag=0 or att.attendance_type='M' and em.approval_flag=0 or
att.attendance_type='W' and ew.approval_flag=0 then 'Waiting for Approval'
when att.attendance_type='P' and el.approval_flag=2 or att.attendance_type='H' and eh.approval_flag=2 or
att.attendance_type='M' and em.approval_flag=2 or att.attendance_type='W' and ew.approval_flag=2 then 'Rejected'
when att.attendance_type='P' and @att.approval_flag is null then '' else 'Waiting for Approval' end,
case when att.attendance_type='P' then 'Marked' when att.attendance_type='L' then 'Marked'
when att.attendance_type='HL' or att.attendance_type='HP' then 'Marked' when att.attendance_type='W'
or ewo.weekoff_id is not null then 'Marked' when (e.customer_id is null and cehv.id is not null) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when el.employee_id is not null then 'Marked' when eh.employee_id is not null then 'Marked' when ew.employee_id is not null then 'Marked' when em.employee_id is not null then 'Marked' else 'Not Marked' end,case when att.attendance_type='P' then check_in.attendance_reason when att.attendance_type='L' then lt.leave_type_name when el.employee_id is not null then lt.leave_type_name when att.attendance_type='HL' or att.attendance_type='HP' then 'Half Day' when att.attendance_type='W' or ewo.weekoff_id is not null then 'Week off' when (e.customer_id is null and cehv.id is not null) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when eh.employee_id is not null then 'Holiday' when ew.employee_id is not null then 'Week off' when em.employee_id is not null then 'Marketoff' else 'Absent' end ,case when att.on_behalf_attendance is not null then concat(man_behalf.first_name::text,man_behalf.last_name::text) else null end
,att.id,e.customer_id,al.id,aa1.id,att.imei,man.id,att.Uniform,att.Samsung_Logo,att.Blue_Color_Check,att.Blue_Color_Percentage,
al.Face_Detection_Flag,att.Check_Out_Qc_Review,att.Check_Out_Distance,sh.id,att.id,e.id;
postgresql apache-spark pyspark apache-spark-sql pyspark-sql
postgresql apache-spark pyspark apache-spark-sql pyspark-sql
edited Mar 7 at 12:06
Srinivasan E
asked Mar 7 at 4:45
Srinivasan ESrinivasan E
12
12
I would recommend you to do 2 things. 1. Run a simple join between any two tables. 2. Run the same functionality using pyspark, after loading each table in separate RDDs and utilizing the join() function. Check if there is a difference in execution time.
– Jim Todd
Mar 7 at 17:50
add a comment |
I would recommend you to do 2 things. 1. Run a simple join between any two tables. 2. Run the same functionality using pyspark, after loading each table in separate RDDs and utilizing the join() function. Check if there is a difference in execution time.
– Jim Todd
Mar 7 at 17:50
I would recommend you to do 2 things. 1. Run a simple join between any two tables. 2. Run the same functionality using pyspark, after loading each table in separate RDDs and utilizing the join() function. Check if there is a difference in execution time.
– Jim Todd
Mar 7 at 17:50
I would recommend you to do 2 things. 1. Run a simple join between any two tables. 2. Run the same functionality using pyspark, after loading each table in separate RDDs and utilizing the join() function. Check if there is a difference in execution time.
– Jim Todd
Mar 7 at 17:50
add a comment |
1 Answer
1
active
oldest
votes
As far as I know, in this case, Query execution time in pyspark
and pgAdmin
would obviously take the same time, as both the queries are getting executed on top of Postgres DB only.
At this point, you have not yet utilized the distributed computing and storage functionality of spark. You have just created a RDD out of the output of SQL from Postgres DB. Only, after this point your operations with this RDD will show a difference in speed.
So, optimization would be on the Postgres DB side only. Below points will help:
- Optimize your SQL so that it runs faster
- Read chunks of tables(simple SQLs) into RDD, and consider doing actions/transformations in pyspark for achieving desired results
instead of the SQL with complex joins.
1
This is just hand-waving. Unless the query and its execution plan are known, nobody can say what the problem is. What I object to is the assumption that doing the work in the application will be faster than doing it in SQL. While there are certainly cases where this is true, my personal experience says that it is usually the other way around.
– Laurenz Albe
Mar 7 at 7:02
A (complex)SQL on top of a relational database is comparatively slower to a pyspark operation that utilizes distributed computing technique(multiple nodes). Just like, counting the number of pages by a single person vs counting equal chunks of the book by 10 people and aggregating it, Isnt it faster? Thats how I articulated the 2nd point in my answer.
– Jim Todd
Mar 7 at 9:35
1
You are potentially right, if the workload can conveniently parallelized in the application in a way that PostgreSQL cannot do better with its internal parallel query. A big "if", since we know nothing about the query in question.
– Laurenz Albe
Mar 7 at 11:00
The "if" could be trivial here. Because, people use Spark majorly to handle huge data and lessen its query/operational time. Principle underlying this is obviously a distributed storage and computation. "RDD" is Resilient Distributed Data. So, it would be a advantage of parallelism in spark that makes a difference.
– Jim Todd
Mar 7 at 14:27
Point taken. Let's see if and what OP answers.
– Laurenz Albe
Mar 7 at 15:37
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55036256%2fhow-to-make-my-postgres-query-faster-using-apache-spark%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
As far as I know, in this case, Query execution time in pyspark
and pgAdmin
would obviously take the same time, as both the queries are getting executed on top of Postgres DB only.
At this point, you have not yet utilized the distributed computing and storage functionality of spark. You have just created a RDD out of the output of SQL from Postgres DB. Only, after this point your operations with this RDD will show a difference in speed.
So, optimization would be on the Postgres DB side only. Below points will help:
- Optimize your SQL so that it runs faster
- Read chunks of tables(simple SQLs) into RDD, and consider doing actions/transformations in pyspark for achieving desired results
instead of the SQL with complex joins.
1
This is just hand-waving. Unless the query and its execution plan are known, nobody can say what the problem is. What I object to is the assumption that doing the work in the application will be faster than doing it in SQL. While there are certainly cases where this is true, my personal experience says that it is usually the other way around.
– Laurenz Albe
Mar 7 at 7:02
A (complex)SQL on top of a relational database is comparatively slower to a pyspark operation that utilizes distributed computing technique(multiple nodes). Just like, counting the number of pages by a single person vs counting equal chunks of the book by 10 people and aggregating it, Isnt it faster? Thats how I articulated the 2nd point in my answer.
– Jim Todd
Mar 7 at 9:35
1
You are potentially right, if the workload can conveniently parallelized in the application in a way that PostgreSQL cannot do better with its internal parallel query. A big "if", since we know nothing about the query in question.
– Laurenz Albe
Mar 7 at 11:00
The "if" could be trivial here. Because, people use Spark majorly to handle huge data and lessen its query/operational time. Principle underlying this is obviously a distributed storage and computation. "RDD" is Resilient Distributed Data. So, it would be a advantage of parallelism in spark that makes a difference.
– Jim Todd
Mar 7 at 14:27
Point taken. Let's see if and what OP answers.
– Laurenz Albe
Mar 7 at 15:37
add a comment |
As far as I know, in this case, Query execution time in pyspark
and pgAdmin
would obviously take the same time, as both the queries are getting executed on top of Postgres DB only.
At this point, you have not yet utilized the distributed computing and storage functionality of spark. You have just created a RDD out of the output of SQL from Postgres DB. Only, after this point your operations with this RDD will show a difference in speed.
So, optimization would be on the Postgres DB side only. Below points will help:
- Optimize your SQL so that it runs faster
- Read chunks of tables(simple SQLs) into RDD, and consider doing actions/transformations in pyspark for achieving desired results
instead of the SQL with complex joins.
1
This is just hand-waving. Unless the query and its execution plan are known, nobody can say what the problem is. What I object to is the assumption that doing the work in the application will be faster than doing it in SQL. While there are certainly cases where this is true, my personal experience says that it is usually the other way around.
– Laurenz Albe
Mar 7 at 7:02
A (complex)SQL on top of a relational database is comparatively slower to a pyspark operation that utilizes distributed computing technique(multiple nodes). Just like, counting the number of pages by a single person vs counting equal chunks of the book by 10 people and aggregating it, Isnt it faster? Thats how I articulated the 2nd point in my answer.
– Jim Todd
Mar 7 at 9:35
1
You are potentially right, if the workload can conveniently parallelized in the application in a way that PostgreSQL cannot do better with its internal parallel query. A big "if", since we know nothing about the query in question.
– Laurenz Albe
Mar 7 at 11:00
The "if" could be trivial here. Because, people use Spark majorly to handle huge data and lessen its query/operational time. Principle underlying this is obviously a distributed storage and computation. "RDD" is Resilient Distributed Data. So, it would be a advantage of parallelism in spark that makes a difference.
– Jim Todd
Mar 7 at 14:27
Point taken. Let's see if and what OP answers.
– Laurenz Albe
Mar 7 at 15:37
add a comment |
As far as I know, in this case, Query execution time in pyspark
and pgAdmin
would obviously take the same time, as both the queries are getting executed on top of Postgres DB only.
At this point, you have not yet utilized the distributed computing and storage functionality of spark. You have just created a RDD out of the output of SQL from Postgres DB. Only, after this point your operations with this RDD will show a difference in speed.
So, optimization would be on the Postgres DB side only. Below points will help:
- Optimize your SQL so that it runs faster
- Read chunks of tables(simple SQLs) into RDD, and consider doing actions/transformations in pyspark for achieving desired results
instead of the SQL with complex joins.
As far as I know, in this case, Query execution time in pyspark
and pgAdmin
would obviously take the same time, as both the queries are getting executed on top of Postgres DB only.
At this point, you have not yet utilized the distributed computing and storage functionality of spark. You have just created a RDD out of the output of SQL from Postgres DB. Only, after this point your operations with this RDD will show a difference in speed.
So, optimization would be on the Postgres DB side only. Below points will help:
- Optimize your SQL so that it runs faster
- Read chunks of tables(simple SQLs) into RDD, and consider doing actions/transformations in pyspark for achieving desired results
instead of the SQL with complex joins.
answered Mar 7 at 5:18
Jim ToddJim Todd
589311
589311
1
This is just hand-waving. Unless the query and its execution plan are known, nobody can say what the problem is. What I object to is the assumption that doing the work in the application will be faster than doing it in SQL. While there are certainly cases where this is true, my personal experience says that it is usually the other way around.
– Laurenz Albe
Mar 7 at 7:02
A (complex)SQL on top of a relational database is comparatively slower to a pyspark operation that utilizes distributed computing technique(multiple nodes). Just like, counting the number of pages by a single person vs counting equal chunks of the book by 10 people and aggregating it, Isnt it faster? Thats how I articulated the 2nd point in my answer.
– Jim Todd
Mar 7 at 9:35
1
You are potentially right, if the workload can conveniently parallelized in the application in a way that PostgreSQL cannot do better with its internal parallel query. A big "if", since we know nothing about the query in question.
– Laurenz Albe
Mar 7 at 11:00
The "if" could be trivial here. Because, people use Spark majorly to handle huge data and lessen its query/operational time. Principle underlying this is obviously a distributed storage and computation. "RDD" is Resilient Distributed Data. So, it would be a advantage of parallelism in spark that makes a difference.
– Jim Todd
Mar 7 at 14:27
Point taken. Let's see if and what OP answers.
– Laurenz Albe
Mar 7 at 15:37
add a comment |
1
This is just hand-waving. Unless the query and its execution plan are known, nobody can say what the problem is. What I object to is the assumption that doing the work in the application will be faster than doing it in SQL. While there are certainly cases where this is true, my personal experience says that it is usually the other way around.
– Laurenz Albe
Mar 7 at 7:02
A (complex)SQL on top of a relational database is comparatively slower to a pyspark operation that utilizes distributed computing technique(multiple nodes). Just like, counting the number of pages by a single person vs counting equal chunks of the book by 10 people and aggregating it, Isnt it faster? Thats how I articulated the 2nd point in my answer.
– Jim Todd
Mar 7 at 9:35
1
You are potentially right, if the workload can conveniently parallelized in the application in a way that PostgreSQL cannot do better with its internal parallel query. A big "if", since we know nothing about the query in question.
– Laurenz Albe
Mar 7 at 11:00
The "if" could be trivial here. Because, people use Spark majorly to handle huge data and lessen its query/operational time. Principle underlying this is obviously a distributed storage and computation. "RDD" is Resilient Distributed Data. So, it would be a advantage of parallelism in spark that makes a difference.
– Jim Todd
Mar 7 at 14:27
Point taken. Let's see if and what OP answers.
– Laurenz Albe
Mar 7 at 15:37
1
1
This is just hand-waving. Unless the query and its execution plan are known, nobody can say what the problem is. What I object to is the assumption that doing the work in the application will be faster than doing it in SQL. While there are certainly cases where this is true, my personal experience says that it is usually the other way around.
– Laurenz Albe
Mar 7 at 7:02
This is just hand-waving. Unless the query and its execution plan are known, nobody can say what the problem is. What I object to is the assumption that doing the work in the application will be faster than doing it in SQL. While there are certainly cases where this is true, my personal experience says that it is usually the other way around.
– Laurenz Albe
Mar 7 at 7:02
A (complex)SQL on top of a relational database is comparatively slower to a pyspark operation that utilizes distributed computing technique(multiple nodes). Just like, counting the number of pages by a single person vs counting equal chunks of the book by 10 people and aggregating it, Isnt it faster? Thats how I articulated the 2nd point in my answer.
– Jim Todd
Mar 7 at 9:35
A (complex)SQL on top of a relational database is comparatively slower to a pyspark operation that utilizes distributed computing technique(multiple nodes). Just like, counting the number of pages by a single person vs counting equal chunks of the book by 10 people and aggregating it, Isnt it faster? Thats how I articulated the 2nd point in my answer.
– Jim Todd
Mar 7 at 9:35
1
1
You are potentially right, if the workload can conveniently parallelized in the application in a way that PostgreSQL cannot do better with its internal parallel query. A big "if", since we know nothing about the query in question.
– Laurenz Albe
Mar 7 at 11:00
You are potentially right, if the workload can conveniently parallelized in the application in a way that PostgreSQL cannot do better with its internal parallel query. A big "if", since we know nothing about the query in question.
– Laurenz Albe
Mar 7 at 11:00
The "if" could be trivial here. Because, people use Spark majorly to handle huge data and lessen its query/operational time. Principle underlying this is obviously a distributed storage and computation. "RDD" is Resilient Distributed Data. So, it would be a advantage of parallelism in spark that makes a difference.
– Jim Todd
Mar 7 at 14:27
The "if" could be trivial here. Because, people use Spark majorly to handle huge data and lessen its query/operational time. Principle underlying this is obviously a distributed storage and computation. "RDD" is Resilient Distributed Data. So, it would be a advantage of parallelism in spark that makes a difference.
– Jim Todd
Mar 7 at 14:27
Point taken. Let's see if and what OP answers.
– Laurenz Albe
Mar 7 at 15:37
Point taken. Let's see if and what OP answers.
– Laurenz Albe
Mar 7 at 15:37
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55036256%2fhow-to-make-my-postgres-query-faster-using-apache-spark%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I would recommend you to do 2 things. 1. Run a simple join between any two tables. 2. Run the same functionality using pyspark, after loading each table in separate RDDs and utilizing the join() function. Check if there is a difference in execution time.
– Jim Todd
Mar 7 at 17:50