How to use threading with selenium for web scraping?2019 Community Moderator ElectionCan Selenium use multi threading in one browser?How to make Selenium and beautifulsoup work faster?What is the difference between a process and a thread?“implements Runnable” vs “extends Thread” in JavaHow do I update the GUI from another thread?What is a daemon thread in Java?How to use threading in Python?C++11 introduced a standardized memory model. What does it mean? And how is it going to affect C++ programming?BeautifulSoup: Get all product links from specific categoryScraping Table using Python and SeleniumPython web-scraping on a multi-layered website without [href]I am getting text error while the code is appicable for on company using python beautifulsoup

If sound is a longitudinal wave, why can we hear it if our ears aren't aligned with the propagation direction?

Would those living in a "perfect society" not understand satire

Movie: boy escapes the real world and goes to a fantasy world with big furry trolls

Giving a career talk in my old university, how prominently should I tell students my salary?

What is the oldest European royal house?

Professor forcing me to attend a conference, I can't afford even with 50% funding

Why is there an extra space when I type "ls" on the Desktop?

Are small insurances worth it?

Is it possible to clone a polymorphic object without manually adding overridden clone method into each derived class in C++?

Use Mercury as quenching liquid for swords?

Under what conditions can the right to remain silent be revoked in the USA?

What is the purpose of a disclaimer like "this is not legal advice"?

Was it really inappropriate to write a pull request for the company I interviewed with?

How to educate team mate to take screenshots for bugs with out unwanted stuff

Writing text next to a table

What does *dead* mean in *What do you mean, dead?*?

Are all players supposed to be able to see each others' character sheets?

Smooth vector fields on a surface modulo diffeomorphisms

Do Paladin Auras of Differing Oaths Stack?

What is Tony Stark injecting into himself in Iron Man 3?

What can I do if someone tampers with my SSH public key?

How can I portion out frozen cookie dough?

Are E natural minor and B harmonic minor related?

How to make sure I'm assertive enough in contact with subordinates?



How to use threading with selenium for web scraping?



2019 Community Moderator ElectionCan Selenium use multi threading in one browser?How to make Selenium and beautifulsoup work faster?What is the difference between a process and a thread?“implements Runnable” vs “extends Thread” in JavaHow do I update the GUI from another thread?What is a daemon thread in Java?How to use threading in Python?C++11 introduced a standardized memory model. What does it mean? And how is it going to affect C++ programming?BeautifulSoup: Get all product links from specific categoryScraping Table using Python and SeleniumPython web-scraping on a multi-layered website without [href]I am getting text error while the code is appicable for on company using python beautifulsoup










1















My main objective is to scrape as many profile links as possible on Khan Academy. And then scrape some specific data on each of these profiles.



My goal with this question is to use threading to make my script work much faster.



So I will present my code in two part: first part without threading an second part with threading.



This the original code without threading:



from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException,StaleElementReferenceException
from bs4 import BeautifulSoup
import re
from requests_html import HTMLSession

session = HTMLSession()
r = session.get('https://www.khanacademy.org/computing/computer-programming/programming#intro-to-programming')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')

#find course steps links
courses_links = soup.find_all(class_='link_1uvuyao-o_O-nodeStyle_cu2reh-o_O-nodeStyleIcon_4udnki')
list_courses=

for links in courses_links:
courses = links.extract()
link_course = courses['href']
title_course= links.find(class_='nodeTitle_145jbuf')
span_title_course=title_course.span
text_span=span_title_course.text.strip()
final_link_course ='https://www.khanacademy.org'+link_course
list_courses[text_span]=final_link_course
#print(list_courses)

# my goal is to loop the below script with each "course link" that I got above with list_courses
for courses_step in list_courses.values():
driver = webdriver.Chrome()
driver.get(courses_step)
while True:
try:
showmore=WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CLASS_NAME,'button_1eqj1ga-o_O-shared_1t8r4tr-o_O-default_9fm203')))
showmore.click()
except TimeoutException:
break
except StaleElementReferenceException:
break

soup=BeautifulSoup(driver.page_source,'html.parser')
#find the profile links
profiles = soup.find_all(href=re.compile("/profile/kaid"))
profile_list=[]
for links in profiles:
links_no_list = links.extract()
text_link = links_no_list['href']
text_link_nodiscussion = text_link[:-10]
final_profile_link ='https://www.khanacademy.org'+text_link_nodiscussion
profile_list.append(final_profile_link)

#remove duplicates
profile_list=list(set(profile_list))

#print number of profiles we got
print('in this link:')
print(courses_step)
print('we have this number of profiles:')
print(len(profile_list))
#create the csv file
filename = "khanscraptry1.csv"
f = open(filename, "w")
headers = "link, date_joined, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx, last_daten"
f.write(headers)

#for each profile link, scrape the specific data and store them into the csv
for link in profile_list:
#to avoid Scrapping same profile multiple times
#print each profile link we are about to scrap
print("Scrapping ",link)
driver.get(link)
#wait for content to load
#if profile does not exist skip
try:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH ,'//*[@id="widget-list"]/div[1]/div[1]')))
except TimeoutException:
continue
soup=BeautifulSoup(driver.page_source,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
if user_info_table is not None:
dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
else:
dates=points=videos='NA'

user_socio_table=soup.find_all('div', class_='discussion-stat')
data =
for gettext in user_socio_table:
category = gettext.find('span')
category_text = category.text.strip()
number = category.previousSibling.strip()
data[category_text] = number

full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks'] #might change answers to answer because when it's 1 it's putting NA instead
for header_value in full_data_keys:
if header_value not in data.keys():
data[header_value]='NA'

user_calendar = soup.find('div',class_='streak-calendar-scroll-container')
if user_calendar is not None:
last_activity = user_calendar.find('span',class_='streak-cell filled')
try:
last_activity_date = last_activity['title']
except TypeError:
last_activity_date='NA'
else:
last_activity_date='NA'
f.write(link + "," + dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "," + last_activity_date + "n")


This code should work fine. But the problem is: it's taking way too much time.



And here is the script that include threading:



from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException,StaleElementReferenceException
from bs4 import BeautifulSoup
import re
from requests_html import HTMLSession
import concurrent.futures

session = HTMLSession()
r = session.get('https://www.khanacademy.org/computing/computer-programming/programming#intro-to-programming')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')

#find course steps links
courses_links = soup.find_all(class_='link_1uvuyao-o_O-nodeStyle_cu2reh-o_O-nodeStyleIcon_4udnki')
list_courses=

for links in courses_links:
courses = links.extract()
link_course = courses['href']
title_course= links.find(class_='nodeTitle_145jbuf')
span_title_course=title_course.span
text_span=span_title_course.text.strip()
final_link_course ='https://www.khanacademy.org'+link_course
list_courses[text_span]=final_link_course

#that's my driver function
def showmore(url, timeout):
driver = webdriver.Chrome()
driver.get(url)
while True:
try:
showmore=WebDriverWait(driver, timeout).until(EC.presence_of_element_located((By.CLASS_NAME,'button_1eqj1ga-o_O-shared_1t8r4tr-o_O-default_9fm203')))
showmore.click()
except TimeoutException:
break
except StaleElementReferenceException:
break

#that's my pool
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
#do this in my pool
future_to_url = executor.submit(showmore, url, 20): url for url in list_courses.values()


As you can see the second script is not doing everything yet. I still have to add the whole data scraping / writing process.



My question is: How to create threadings for the scrape and write parts? How should I order these threadings?



More broadly: How to make my script run as fast as possible?










share|improve this question
























  • The easy version is have each thread write their own file and then after the run is complete, stitch the files together.

    – JeffC
    Mar 7 at 1:09











  • threading on webdriver may not be the right way stackoverflow.com/a/30829406/6770946. yoo may want to consider using grequests which allow threading of requests instead of tabs in browser.

    – Y Y
    2 days ago















1















My main objective is to scrape as many profile links as possible on Khan Academy. And then scrape some specific data on each of these profiles.



My goal with this question is to use threading to make my script work much faster.



So I will present my code in two part: first part without threading an second part with threading.



This the original code without threading:



from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException,StaleElementReferenceException
from bs4 import BeautifulSoup
import re
from requests_html import HTMLSession

session = HTMLSession()
r = session.get('https://www.khanacademy.org/computing/computer-programming/programming#intro-to-programming')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')

#find course steps links
courses_links = soup.find_all(class_='link_1uvuyao-o_O-nodeStyle_cu2reh-o_O-nodeStyleIcon_4udnki')
list_courses=

for links in courses_links:
courses = links.extract()
link_course = courses['href']
title_course= links.find(class_='nodeTitle_145jbuf')
span_title_course=title_course.span
text_span=span_title_course.text.strip()
final_link_course ='https://www.khanacademy.org'+link_course
list_courses[text_span]=final_link_course
#print(list_courses)

# my goal is to loop the below script with each "course link" that I got above with list_courses
for courses_step in list_courses.values():
driver = webdriver.Chrome()
driver.get(courses_step)
while True:
try:
showmore=WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CLASS_NAME,'button_1eqj1ga-o_O-shared_1t8r4tr-o_O-default_9fm203')))
showmore.click()
except TimeoutException:
break
except StaleElementReferenceException:
break

soup=BeautifulSoup(driver.page_source,'html.parser')
#find the profile links
profiles = soup.find_all(href=re.compile("/profile/kaid"))
profile_list=[]
for links in profiles:
links_no_list = links.extract()
text_link = links_no_list['href']
text_link_nodiscussion = text_link[:-10]
final_profile_link ='https://www.khanacademy.org'+text_link_nodiscussion
profile_list.append(final_profile_link)

#remove duplicates
profile_list=list(set(profile_list))

#print number of profiles we got
print('in this link:')
print(courses_step)
print('we have this number of profiles:')
print(len(profile_list))
#create the csv file
filename = "khanscraptry1.csv"
f = open(filename, "w")
headers = "link, date_joined, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx, last_daten"
f.write(headers)

#for each profile link, scrape the specific data and store them into the csv
for link in profile_list:
#to avoid Scrapping same profile multiple times
#print each profile link we are about to scrap
print("Scrapping ",link)
driver.get(link)
#wait for content to load
#if profile does not exist skip
try:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH ,'//*[@id="widget-list"]/div[1]/div[1]')))
except TimeoutException:
continue
soup=BeautifulSoup(driver.page_source,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
if user_info_table is not None:
dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
else:
dates=points=videos='NA'

user_socio_table=soup.find_all('div', class_='discussion-stat')
data =
for gettext in user_socio_table:
category = gettext.find('span')
category_text = category.text.strip()
number = category.previousSibling.strip()
data[category_text] = number

full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks'] #might change answers to answer because when it's 1 it's putting NA instead
for header_value in full_data_keys:
if header_value not in data.keys():
data[header_value]='NA'

user_calendar = soup.find('div',class_='streak-calendar-scroll-container')
if user_calendar is not None:
last_activity = user_calendar.find('span',class_='streak-cell filled')
try:
last_activity_date = last_activity['title']
except TypeError:
last_activity_date='NA'
else:
last_activity_date='NA'
f.write(link + "," + dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "," + last_activity_date + "n")


This code should work fine. But the problem is: it's taking way too much time.



And here is the script that include threading:



from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException,StaleElementReferenceException
from bs4 import BeautifulSoup
import re
from requests_html import HTMLSession
import concurrent.futures

session = HTMLSession()
r = session.get('https://www.khanacademy.org/computing/computer-programming/programming#intro-to-programming')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')

#find course steps links
courses_links = soup.find_all(class_='link_1uvuyao-o_O-nodeStyle_cu2reh-o_O-nodeStyleIcon_4udnki')
list_courses=

for links in courses_links:
courses = links.extract()
link_course = courses['href']
title_course= links.find(class_='nodeTitle_145jbuf')
span_title_course=title_course.span
text_span=span_title_course.text.strip()
final_link_course ='https://www.khanacademy.org'+link_course
list_courses[text_span]=final_link_course

#that's my driver function
def showmore(url, timeout):
driver = webdriver.Chrome()
driver.get(url)
while True:
try:
showmore=WebDriverWait(driver, timeout).until(EC.presence_of_element_located((By.CLASS_NAME,'button_1eqj1ga-o_O-shared_1t8r4tr-o_O-default_9fm203')))
showmore.click()
except TimeoutException:
break
except StaleElementReferenceException:
break

#that's my pool
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
#do this in my pool
future_to_url = executor.submit(showmore, url, 20): url for url in list_courses.values()


As you can see the second script is not doing everything yet. I still have to add the whole data scraping / writing process.



My question is: How to create threadings for the scrape and write parts? How should I order these threadings?



More broadly: How to make my script run as fast as possible?










share|improve this question
























  • The easy version is have each thread write their own file and then after the run is complete, stitch the files together.

    – JeffC
    Mar 7 at 1:09











  • threading on webdriver may not be the right way stackoverflow.com/a/30829406/6770946. yoo may want to consider using grequests which allow threading of requests instead of tabs in browser.

    – Y Y
    2 days ago













1












1








1








My main objective is to scrape as many profile links as possible on Khan Academy. And then scrape some specific data on each of these profiles.



My goal with this question is to use threading to make my script work much faster.



So I will present my code in two part: first part without threading an second part with threading.



This the original code without threading:



from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException,StaleElementReferenceException
from bs4 import BeautifulSoup
import re
from requests_html import HTMLSession

session = HTMLSession()
r = session.get('https://www.khanacademy.org/computing/computer-programming/programming#intro-to-programming')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')

#find course steps links
courses_links = soup.find_all(class_='link_1uvuyao-o_O-nodeStyle_cu2reh-o_O-nodeStyleIcon_4udnki')
list_courses=

for links in courses_links:
courses = links.extract()
link_course = courses['href']
title_course= links.find(class_='nodeTitle_145jbuf')
span_title_course=title_course.span
text_span=span_title_course.text.strip()
final_link_course ='https://www.khanacademy.org'+link_course
list_courses[text_span]=final_link_course
#print(list_courses)

# my goal is to loop the below script with each "course link" that I got above with list_courses
for courses_step in list_courses.values():
driver = webdriver.Chrome()
driver.get(courses_step)
while True:
try:
showmore=WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CLASS_NAME,'button_1eqj1ga-o_O-shared_1t8r4tr-o_O-default_9fm203')))
showmore.click()
except TimeoutException:
break
except StaleElementReferenceException:
break

soup=BeautifulSoup(driver.page_source,'html.parser')
#find the profile links
profiles = soup.find_all(href=re.compile("/profile/kaid"))
profile_list=[]
for links in profiles:
links_no_list = links.extract()
text_link = links_no_list['href']
text_link_nodiscussion = text_link[:-10]
final_profile_link ='https://www.khanacademy.org'+text_link_nodiscussion
profile_list.append(final_profile_link)

#remove duplicates
profile_list=list(set(profile_list))

#print number of profiles we got
print('in this link:')
print(courses_step)
print('we have this number of profiles:')
print(len(profile_list))
#create the csv file
filename = "khanscraptry1.csv"
f = open(filename, "w")
headers = "link, date_joined, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx, last_daten"
f.write(headers)

#for each profile link, scrape the specific data and store them into the csv
for link in profile_list:
#to avoid Scrapping same profile multiple times
#print each profile link we are about to scrap
print("Scrapping ",link)
driver.get(link)
#wait for content to load
#if profile does not exist skip
try:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH ,'//*[@id="widget-list"]/div[1]/div[1]')))
except TimeoutException:
continue
soup=BeautifulSoup(driver.page_source,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
if user_info_table is not None:
dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
else:
dates=points=videos='NA'

user_socio_table=soup.find_all('div', class_='discussion-stat')
data =
for gettext in user_socio_table:
category = gettext.find('span')
category_text = category.text.strip()
number = category.previousSibling.strip()
data[category_text] = number

full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks'] #might change answers to answer because when it's 1 it's putting NA instead
for header_value in full_data_keys:
if header_value not in data.keys():
data[header_value]='NA'

user_calendar = soup.find('div',class_='streak-calendar-scroll-container')
if user_calendar is not None:
last_activity = user_calendar.find('span',class_='streak-cell filled')
try:
last_activity_date = last_activity['title']
except TypeError:
last_activity_date='NA'
else:
last_activity_date='NA'
f.write(link + "," + dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "," + last_activity_date + "n")


This code should work fine. But the problem is: it's taking way too much time.



And here is the script that include threading:



from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException,StaleElementReferenceException
from bs4 import BeautifulSoup
import re
from requests_html import HTMLSession
import concurrent.futures

session = HTMLSession()
r = session.get('https://www.khanacademy.org/computing/computer-programming/programming#intro-to-programming')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')

#find course steps links
courses_links = soup.find_all(class_='link_1uvuyao-o_O-nodeStyle_cu2reh-o_O-nodeStyleIcon_4udnki')
list_courses=

for links in courses_links:
courses = links.extract()
link_course = courses['href']
title_course= links.find(class_='nodeTitle_145jbuf')
span_title_course=title_course.span
text_span=span_title_course.text.strip()
final_link_course ='https://www.khanacademy.org'+link_course
list_courses[text_span]=final_link_course

#that's my driver function
def showmore(url, timeout):
driver = webdriver.Chrome()
driver.get(url)
while True:
try:
showmore=WebDriverWait(driver, timeout).until(EC.presence_of_element_located((By.CLASS_NAME,'button_1eqj1ga-o_O-shared_1t8r4tr-o_O-default_9fm203')))
showmore.click()
except TimeoutException:
break
except StaleElementReferenceException:
break

#that's my pool
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
#do this in my pool
future_to_url = executor.submit(showmore, url, 20): url for url in list_courses.values()


As you can see the second script is not doing everything yet. I still have to add the whole data scraping / writing process.



My question is: How to create threadings for the scrape and write parts? How should I order these threadings?



More broadly: How to make my script run as fast as possible?










share|improve this question
















My main objective is to scrape as many profile links as possible on Khan Academy. And then scrape some specific data on each of these profiles.



My goal with this question is to use threading to make my script work much faster.



So I will present my code in two part: first part without threading an second part with threading.



This the original code without threading:



from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException,StaleElementReferenceException
from bs4 import BeautifulSoup
import re
from requests_html import HTMLSession

session = HTMLSession()
r = session.get('https://www.khanacademy.org/computing/computer-programming/programming#intro-to-programming')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')

#find course steps links
courses_links = soup.find_all(class_='link_1uvuyao-o_O-nodeStyle_cu2reh-o_O-nodeStyleIcon_4udnki')
list_courses=

for links in courses_links:
courses = links.extract()
link_course = courses['href']
title_course= links.find(class_='nodeTitle_145jbuf')
span_title_course=title_course.span
text_span=span_title_course.text.strip()
final_link_course ='https://www.khanacademy.org'+link_course
list_courses[text_span]=final_link_course
#print(list_courses)

# my goal is to loop the below script with each "course link" that I got above with list_courses
for courses_step in list_courses.values():
driver = webdriver.Chrome()
driver.get(courses_step)
while True:
try:
showmore=WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CLASS_NAME,'button_1eqj1ga-o_O-shared_1t8r4tr-o_O-default_9fm203')))
showmore.click()
except TimeoutException:
break
except StaleElementReferenceException:
break

soup=BeautifulSoup(driver.page_source,'html.parser')
#find the profile links
profiles = soup.find_all(href=re.compile("/profile/kaid"))
profile_list=[]
for links in profiles:
links_no_list = links.extract()
text_link = links_no_list['href']
text_link_nodiscussion = text_link[:-10]
final_profile_link ='https://www.khanacademy.org'+text_link_nodiscussion
profile_list.append(final_profile_link)

#remove duplicates
profile_list=list(set(profile_list))

#print number of profiles we got
print('in this link:')
print(courses_step)
print('we have this number of profiles:')
print(len(profile_list))
#create the csv file
filename = "khanscraptry1.csv"
f = open(filename, "w")
headers = "link, date_joined, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx, last_daten"
f.write(headers)

#for each profile link, scrape the specific data and store them into the csv
for link in profile_list:
#to avoid Scrapping same profile multiple times
#print each profile link we are about to scrap
print("Scrapping ",link)
driver.get(link)
#wait for content to load
#if profile does not exist skip
try:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH ,'//*[@id="widget-list"]/div[1]/div[1]')))
except TimeoutException:
continue
soup=BeautifulSoup(driver.page_source,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
if user_info_table is not None:
dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
else:
dates=points=videos='NA'

user_socio_table=soup.find_all('div', class_='discussion-stat')
data =
for gettext in user_socio_table:
category = gettext.find('span')
category_text = category.text.strip()
number = category.previousSibling.strip()
data[category_text] = number

full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks'] #might change answers to answer because when it's 1 it's putting NA instead
for header_value in full_data_keys:
if header_value not in data.keys():
data[header_value]='NA'

user_calendar = soup.find('div',class_='streak-calendar-scroll-container')
if user_calendar is not None:
last_activity = user_calendar.find('span',class_='streak-cell filled')
try:
last_activity_date = last_activity['title']
except TypeError:
last_activity_date='NA'
else:
last_activity_date='NA'
f.write(link + "," + dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "," + last_activity_date + "n")


This code should work fine. But the problem is: it's taking way too much time.



And here is the script that include threading:



from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException,StaleElementReferenceException
from bs4 import BeautifulSoup
import re
from requests_html import HTMLSession
import concurrent.futures

session = HTMLSession()
r = session.get('https://www.khanacademy.org/computing/computer-programming/programming#intro-to-programming')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')

#find course steps links
courses_links = soup.find_all(class_='link_1uvuyao-o_O-nodeStyle_cu2reh-o_O-nodeStyleIcon_4udnki')
list_courses=

for links in courses_links:
courses = links.extract()
link_course = courses['href']
title_course= links.find(class_='nodeTitle_145jbuf')
span_title_course=title_course.span
text_span=span_title_course.text.strip()
final_link_course ='https://www.khanacademy.org'+link_course
list_courses[text_span]=final_link_course

#that's my driver function
def showmore(url, timeout):
driver = webdriver.Chrome()
driver.get(url)
while True:
try:
showmore=WebDriverWait(driver, timeout).until(EC.presence_of_element_located((By.CLASS_NAME,'button_1eqj1ga-o_O-shared_1t8r4tr-o_O-default_9fm203')))
showmore.click()
except TimeoutException:
break
except StaleElementReferenceException:
break

#that's my pool
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
#do this in my pool
future_to_url = executor.submit(showmore, url, 20): url for url in list_courses.values()


As you can see the second script is not doing everything yet. I still have to add the whole data scraping / writing process.



My question is: How to create threadings for the scrape and write parts? How should I order these threadings?



More broadly: How to make my script run as fast as possible?







python-3.x multithreading selenium web-scraping beautifulsoup






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 7 at 2:09









Jane

628218




628218










asked Mar 6 at 23:09









RobZRobZ

477




477












  • The easy version is have each thread write their own file and then after the run is complete, stitch the files together.

    – JeffC
    Mar 7 at 1:09











  • threading on webdriver may not be the right way stackoverflow.com/a/30829406/6770946. yoo may want to consider using grequests which allow threading of requests instead of tabs in browser.

    – Y Y
    2 days ago

















  • The easy version is have each thread write their own file and then after the run is complete, stitch the files together.

    – JeffC
    Mar 7 at 1:09











  • threading on webdriver may not be the right way stackoverflow.com/a/30829406/6770946. yoo may want to consider using grequests which allow threading of requests instead of tabs in browser.

    – Y Y
    2 days ago
















The easy version is have each thread write their own file and then after the run is complete, stitch the files together.

– JeffC
Mar 7 at 1:09





The easy version is have each thread write their own file and then after the run is complete, stitch the files together.

– JeffC
Mar 7 at 1:09













threading on webdriver may not be the right way stackoverflow.com/a/30829406/6770946. yoo may want to consider using grequests which allow threading of requests instead of tabs in browser.

– Y Y
2 days ago





threading on webdriver may not be the right way stackoverflow.com/a/30829406/6770946. yoo may want to consider using grequests which allow threading of requests instead of tabs in browser.

– Y Y
2 days ago












1 Answer
1






active

oldest

votes


















0














To answer your "more broadly" question, you ought to use asyncio in conjuction with requests or similar packages. A decent guide for doing so can be found here. Threading is not built for running asynchronous http requests.



I can't show you how to do write your code with asyncio because I hardly know how to use it myself, and it would likely take hundreds of lines of code to finish.



If you want a quick solution to increase performance using the code you already have, you should set your selenium browser to headless mode:



from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
driver = webdriver.Chrome('YOUR_CHROMEDRIVER_PATH_HERE', chrome_options=options)





share|improve this answer























  • switching to headless won't make a large difference in performance

    – Corey Goldberg
    Mar 7 at 1:02











  • @CoreyGoldberg Yes unfortunately headless mode is not going to change much, do you have suggestions?

    – RobZ
    4 hours ago










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55033633%2fhow-to-use-threading-with-selenium-for-web-scraping%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














To answer your "more broadly" question, you ought to use asyncio in conjuction with requests or similar packages. A decent guide for doing so can be found here. Threading is not built for running asynchronous http requests.



I can't show you how to do write your code with asyncio because I hardly know how to use it myself, and it would likely take hundreds of lines of code to finish.



If you want a quick solution to increase performance using the code you already have, you should set your selenium browser to headless mode:



from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
driver = webdriver.Chrome('YOUR_CHROMEDRIVER_PATH_HERE', chrome_options=options)





share|improve this answer























  • switching to headless won't make a large difference in performance

    – Corey Goldberg
    Mar 7 at 1:02











  • @CoreyGoldberg Yes unfortunately headless mode is not going to change much, do you have suggestions?

    – RobZ
    4 hours ago















0














To answer your "more broadly" question, you ought to use asyncio in conjuction with requests or similar packages. A decent guide for doing so can be found here. Threading is not built for running asynchronous http requests.



I can't show you how to do write your code with asyncio because I hardly know how to use it myself, and it would likely take hundreds of lines of code to finish.



If you want a quick solution to increase performance using the code you already have, you should set your selenium browser to headless mode:



from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
driver = webdriver.Chrome('YOUR_CHROMEDRIVER_PATH_HERE', chrome_options=options)





share|improve this answer























  • switching to headless won't make a large difference in performance

    – Corey Goldberg
    Mar 7 at 1:02











  • @CoreyGoldberg Yes unfortunately headless mode is not going to change much, do you have suggestions?

    – RobZ
    4 hours ago













0












0








0







To answer your "more broadly" question, you ought to use asyncio in conjuction with requests or similar packages. A decent guide for doing so can be found here. Threading is not built for running asynchronous http requests.



I can't show you how to do write your code with asyncio because I hardly know how to use it myself, and it would likely take hundreds of lines of code to finish.



If you want a quick solution to increase performance using the code you already have, you should set your selenium browser to headless mode:



from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
driver = webdriver.Chrome('YOUR_CHROMEDRIVER_PATH_HERE', chrome_options=options)





share|improve this answer













To answer your "more broadly" question, you ought to use asyncio in conjuction with requests or similar packages. A decent guide for doing so can be found here. Threading is not built for running asynchronous http requests.



I can't show you how to do write your code with asyncio because I hardly know how to use it myself, and it would likely take hundreds of lines of code to finish.



If you want a quick solution to increase performance using the code you already have, you should set your selenium browser to headless mode:



from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
driver = webdriver.Chrome('YOUR_CHROMEDRIVER_PATH_HERE', chrome_options=options)






share|improve this answer












share|improve this answer



share|improve this answer










answered Mar 7 at 0:29









JaneJane

628218




628218












  • switching to headless won't make a large difference in performance

    – Corey Goldberg
    Mar 7 at 1:02











  • @CoreyGoldberg Yes unfortunately headless mode is not going to change much, do you have suggestions?

    – RobZ
    4 hours ago

















  • switching to headless won't make a large difference in performance

    – Corey Goldberg
    Mar 7 at 1:02











  • @CoreyGoldberg Yes unfortunately headless mode is not going to change much, do you have suggestions?

    – RobZ
    4 hours ago
















switching to headless won't make a large difference in performance

– Corey Goldberg
Mar 7 at 1:02





switching to headless won't make a large difference in performance

– Corey Goldberg
Mar 7 at 1:02













@CoreyGoldberg Yes unfortunately headless mode is not going to change much, do you have suggestions?

– RobZ
4 hours ago





@CoreyGoldberg Yes unfortunately headless mode is not going to change much, do you have suggestions?

– RobZ
4 hours ago



















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55033633%2fhow-to-use-threading-with-selenium-for-web-scraping%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Identity Server 4 is not redirecting to Angular app after login2019 Community Moderator ElectionIdentity Server 4 and dockerIdentityserver implicit flow unauthorized_clientIdentityServer Hybrid Flow - Access Token is null after user successful loginIdentity Server to MVC client : Page Redirect After loginLogin with Steam OpenId(oidc-client-js)Identity Server 4+.NET Core 2.0 + IdentityIdentityServer4 post-login redirect not working in Edge browserCall to IdentityServer4 generates System.NullReferenceException: Object reference not set to an instance of an objectIdentityServer4 without HTTPS not workingHow to get Authorization code from identity server without login form

2005 Ahvaz unrest Contents Background Causes Casualties Aftermath See also References Navigation menue"At Least 10 Are Killed by Bombs in Iran""Iran"Archived"Arab-Iranians in Iran to make April 15 'Day of Fury'"State of Mind, State of Order: Reactions to Ethnic Unrest in the Islamic Republic of Iran.10.1111/j.1754-9469.2008.00028.x"Iran hangs Arab separatists"Iran Overview from ArchivedConstitution of the Islamic Republic of Iran"Tehran puzzled by forged 'riots' letter""Iran and its minorities: Down in the second class""Iran: Handling Of Ahvaz Unrest Could End With Televised Confessions""Bombings Rock Iran Ahead of Election""Five die in Iran ethnic clashes""Iran: Need for restraint as anniversary of unrest in Khuzestan approaches"Archived"Iranian Sunni protesters killed in clashes with security forces"Archived

Can't initialize raids on a new ASUS Prime B360M-A motherboard2019 Community Moderator ElectionSimilar to RAID config yet more like mirroring solution?Can't get motherboard serial numberWhy does the BIOS entry point start with a WBINVD instruction?UEFI performance Asus Maximus V Extreme