Can't parse some content using a customized methodDifference between append vs. extend list methods in PythonHow do I parse a string to a float or int in Python?Understanding Python super() with __init__() methodsStatic methods in Python?Proper way to declare custom exceptions in modern Python?How do I parse XML in Python?Why can't Python parse this JSON data?Does Python have a string 'contains' substring method?UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)Python - Passing argument of a function in a class

Can't parse some content using a customized methodDifference between append vs. extend list methods in PythonHow do I parse a string to a float or int in Python?Understanding Python super() with init() methodsStatic methods in Python?Proper way to declare custom exceptions in modern Python?How do I parse XML in Python?Why can't Python parse this JSON data?Does Python have a string 'contains' substring method?UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)Python - Passing argument of a function in a class

Is there an expression that means doing something right before you will need it rather than doing it in case you might need it?

Size of subfigure fitting its content (tikzpicture)

Short story with a alien planet, government officials must wear exploding medallions

Avoiding direct proof while writing proof by induction

What killed these X2 caps?

Should I tell management that I intend to leave due to bad software development practices?

What does the expression "A Mann!" means

ssTTsSTtRrriinInnnnNNNIiinngg

How seriously should I take size and weight limits of hand luggage?

CAST throwing error when run in stored procedure but not when run as raw query

How can I determine if the org that I'm currently connected to is a scratch org?

I would say: "You are another teacher", but she is a woman and I am a man

Alternative to sending password over mail?

Why do bosons tend to occupy the same state?

What mechanic is there to disable a threat instead of killing it?

How could indestructible materials be used in power generation?

What's the in-universe reasoning behind sorcerers needing material components?

How badly should I try to prevent a user from XSSing themselves?

Avoiding the "not like other girls" trope?

Why didn't Boeing produce its own regional jet?

Is "remove commented out code" correct English?

Reverse dictionary where values are lists

Why was the shrinking from 8″ made only to 5.25″ and not smaller (4″ or less)?

One verb to replace 'be a member of' a club

Can't parse some content using a customized method

Difference between append vs. extend list methods in PythonHow do I parse a string to a float or int in Python?Understanding Python super() with __init__() methodsStatic methods in Python?Proper way to declare custom exceptions in modern Python?How do I parse XML in Python?Why can't Python parse this JSON data?Does Python have a string 'contains' substring method?UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)Python - Passing argument of a function in a class

I've written a script using scrapy to get the name, phone number and email from a website. The content I'm after are available in two diferent links, as in name and phone in one link and the email is in another link. I've used here yellowpages.com as an example and tried to implement the logic in such a way so that I can parse the email even when I'm in it's landing page. It's a requirement that I can't use meta. However, I used requests and BeautifulSoup in combination with scrapy to accomplish the job complying the above condition but it is real slow.

Working one (along with requests and BeautifulSoup):

import scrapy
import requests
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess

def get_email(target_link):
 res = requests.get(target_link)
 soup = BeautifulSoup(res.text,"lxml")
 email = soup.select_one("a.email-business[href^='mailto:']")
 if email:
 return email.get("href")
 else:
 return None

class YellowpagesSpider(scrapy.Spider):
 name = "yellowpages"
 start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

 def parse(self,response):
 for items in response.css("div.v-card .info"):
 name = items.css("a.business-name > span::text").get()
 phone = items.css("div.phones::text").get()
 email = get_email(response.urljoin(items.css("a.business-name::attr(href)").get()))
 yield "Name":name,"Phone":phone,"Email":email

if __name__ == "__main__":
 c = CrawlerProcess(
 'USER_AGENT': 'Mozilla/5.0',
 )
 c.crawl(YellowpagesSpider)
 c.start()

I'm trying to mimic the above concept without requests and BeautifulSoup but can't make it work.

import scrapy
from scrapy.crawler import CrawlerProcess

class YellowpagesSpider(scrapy.Spider):
 name = "yellowpages"
 start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

 def parse(self,response):
 for items in response.css("div.v-card .info"):
 name = items.css("a.business-name > span::text").get()
 phone = items.css("div.phones::text").get()
 email_link = response.urljoin(items.css("a.business-name::attr(href)").get())

 #CANT APPLY THE LOGIC IN THE FOLLOWING LINE

 email = self.get_email(email_link)
 yield "Name":name,"Phone":phone,"Email":email

 def get_email(self,link):
 email = response.css("a.email-business[href^='mailto:']::attr(href)").get()
 return email

if __name__ == "__main__":
 c = CrawlerProcess(
 'USER_AGENT': 'Mozilla/5.0',
 )
 c.crawl(YellowpagesSpider)
 c.start()

How can I make my second script work mimicking the first script?

asked Mar 8 at 22:37

robots.txt

339118

Why can't you use request.meta? It's the appropriate tool for the job.

– stranac
Mar 9 at 8:47

add a comment |

Working one (along with requests and BeautifulSoup):

import scrapy
import requests
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess

def get_email(target_link):
 res = requests.get(target_link)
 soup = BeautifulSoup(res.text,"lxml")
 email = soup.select_one("a.email-business[href^='mailto:']")
 if email:
 return email.get("href")
 else:
 return None

class YellowpagesSpider(scrapy.Spider):
 name = "yellowpages"
 start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

 def parse(self,response):
 for items in response.css("div.v-card .info"):
 name = items.css("a.business-name > span::text").get()
 phone = items.css("div.phones::text").get()
 email = get_email(response.urljoin(items.css("a.business-name::attr(href)").get()))
 yield "Name":name,"Phone":phone,"Email":email

if __name__ == "__main__":
 c = CrawlerProcess(
 'USER_AGENT': 'Mozilla/5.0',
 )
 c.crawl(YellowpagesSpider)
 c.start()

I'm trying to mimic the above concept without requests and BeautifulSoup but can't make it work.

import scrapy
from scrapy.crawler import CrawlerProcess

class YellowpagesSpider(scrapy.Spider):
 name = "yellowpages"
 start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

 def parse(self,response):
 for items in response.css("div.v-card .info"):
 name = items.css("a.business-name > span::text").get()
 phone = items.css("div.phones::text").get()
 email_link = response.urljoin(items.css("a.business-name::attr(href)").get())

 #CANT APPLY THE LOGIC IN THE FOLLOWING LINE

 email = self.get_email(email_link)
 yield "Name":name,"Phone":phone,"Email":email

 def get_email(self,link):
 email = response.css("a.email-business[href^='mailto:']::attr(href)").get()
 return email

if __name__ == "__main__":
 c = CrawlerProcess(
 'USER_AGENT': 'Mozilla/5.0',
 )
 c.crawl(YellowpagesSpider)
 c.start()

How can I make my second script work mimicking the first script?

asked Mar 8 at 22:37

robots.txt

339118

Why can't you use request.meta? It's the appropriate tool for the job.

– stranac
Mar 9 at 8:47

add a comment |

Working one (along with requests and BeautifulSoup):

import scrapy
import requests
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess

def get_email(target_link):
 res = requests.get(target_link)
 soup = BeautifulSoup(res.text,"lxml")
 email = soup.select_one("a.email-business[href^='mailto:']")
 if email:
 return email.get("href")
 else:
 return None

class YellowpagesSpider(scrapy.Spider):
 name = "yellowpages"
 start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

 def parse(self,response):
 for items in response.css("div.v-card .info"):
 name = items.css("a.business-name > span::text").get()
 phone = items.css("div.phones::text").get()
 email = get_email(response.urljoin(items.css("a.business-name::attr(href)").get()))
 yield "Name":name,"Phone":phone,"Email":email

if __name__ == "__main__":
 c = CrawlerProcess(
 'USER_AGENT': 'Mozilla/5.0',
 )
 c.crawl(YellowpagesSpider)
 c.start()

I'm trying to mimic the above concept without requests and BeautifulSoup but can't make it work.

import scrapy
from scrapy.crawler import CrawlerProcess

class YellowpagesSpider(scrapy.Spider):
 name = "yellowpages"
 start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

 def parse(self,response):
 for items in response.css("div.v-card .info"):
 name = items.css("a.business-name > span::text").get()
 phone = items.css("div.phones::text").get()
 email_link = response.urljoin(items.css("a.business-name::attr(href)").get())

 #CANT APPLY THE LOGIC IN THE FOLLOWING LINE

 email = self.get_email(email_link)
 yield "Name":name,"Phone":phone,"Email":email

 def get_email(self,link):
 email = response.css("a.email-business[href^='mailto:']::attr(href)").get()
 return email

if __name__ == "__main__":
 c = CrawlerProcess(
 'USER_AGENT': 'Mozilla/5.0',
 )
 c.crawl(YellowpagesSpider)
 c.start()

How can I make my second script work mimicking the first script?

asked Mar 8 at 22:37

robots.txt

339118

Working one (along with requests and BeautifulSoup):

import scrapy
import requests
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess

def get_email(target_link):
 res = requests.get(target_link)
 soup = BeautifulSoup(res.text,"lxml")
 email = soup.select_one("a.email-business[href^='mailto:']")
 if email:
 return email.get("href")
 else:
 return None

class YellowpagesSpider(scrapy.Spider):
 name = "yellowpages"
 start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

 def parse(self,response):
 for items in response.css("div.v-card .info"):
 name = items.css("a.business-name > span::text").get()
 phone = items.css("div.phones::text").get()
 email = get_email(response.urljoin(items.css("a.business-name::attr(href)").get()))
 yield "Name":name,"Phone":phone,"Email":email

if __name__ == "__main__":
 c = CrawlerProcess(
 'USER_AGENT': 'Mozilla/5.0',
 )
 c.crawl(YellowpagesSpider)
 c.start()

I'm trying to mimic the above concept without requests and BeautifulSoup but can't make it work.

import scrapy
from scrapy.crawler import CrawlerProcess

class YellowpagesSpider(scrapy.Spider):
 name = "yellowpages"
 start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

 def parse(self,response):
 for items in response.css("div.v-card .info"):
 name = items.css("a.business-name > span::text").get()
 phone = items.css("div.phones::text").get()
 email_link = response.urljoin(items.css("a.business-name::attr(href)").get())

 #CANT APPLY THE LOGIC IN THE FOLLOWING LINE

 email = self.get_email(email_link)
 yield "Name":name,"Phone":phone,"Email":email

 def get_email(self,link):
 email = response.css("a.email-business[href^='mailto:']::attr(href)").get()
 return email

if __name__ == "__main__":
 c = CrawlerProcess(
 'USER_AGENT': 'Mozilla/5.0',
 )
 c.crawl(YellowpagesSpider)
 c.start()

How can I make my second script work mimicking the first script?

python python-3.x web-scraping scrapy

asked Mar 8 at 22:37

robots.txt

339118

asked Mar 8 at 22:37

robots.txt

339118

asked Mar 8 at 22:37

robots.txt

339118

asked Mar 8 at 22:37

robots.txt

339118

asked Mar 8 at 22:37

robots.txt

339118

Why can't you use request.meta? It's the appropriate tool for the job.

– stranac
Mar 9 at 8:47

add a comment |

Why can't you use request.meta? It's the appropriate tool for the job.

– stranac
Mar 9 at 8:47

Why can't you use request.meta? It's the appropriate tool for the job.

– stranac
Mar 9 at 8:47

add a comment |

1 Answer
1

active

oldest

votes

I'd use response.meta, but if is required to avoid it, ok, let's try in another way: check lib https://pypi.org/project/scrapy-inline-requests/

from inline_requests import inline_requests


class YellowpagesSpider(scrapy.Spider):
 name = "yellowpages"
 start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

 @inline_requests
 def parse(self, response):
 for items in response.css("div.v-card .info"):
 name = items.css("a.business-name > span::text").get()
 phone = items.css("div.phones::text").get()

 email_url = items.css("a.business-name::attr(href)").get()
 email_resp = yield scrapy.Request(response.urljoin(email_url), meta='handle_httpstatus_all': True)
 email = email_resp.css("a.email-business[href^='mailto:']::attr(href)").get() if email_resp.status == 200 else None
 yield "Name": name, "Phone": phone, "Email": email

edited Mar 10 at 18:41

answered Mar 10 at 14:19

vezunchik

1,3361515

Perfect!!! Thanks @vezunchik for your awesome solution. Make sure to add yield before scrapy.Request(response.urljoin(email_url)).

– robots.txt
Mar 10 at 16:57

Oops, you're totally right, this is typo :)

– vezunchik
Mar 10 at 18:42

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55071962%2fcant-parse-some-content-using-a-customized-method%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I'd use response.meta, but if is required to avoid it, ok, let's try in another way: check lib https://pypi.org/project/scrapy-inline-requests/

from inline_requests import inline_requests


class YellowpagesSpider(scrapy.Spider):
 name = "yellowpages"
 start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

 @inline_requests
 def parse(self, response):
 for items in response.css("div.v-card .info"):
 name = items.css("a.business-name > span::text").get()
 phone = items.css("div.phones::text").get()

 email_url = items.css("a.business-name::attr(href)").get()
 email_resp = yield scrapy.Request(response.urljoin(email_url), meta='handle_httpstatus_all': True)
 email = email_resp.css("a.email-business[href^='mailto:']::attr(href)").get() if email_resp.status == 200 else None
 yield "Name": name, "Phone": phone, "Email": email

edited Mar 10 at 18:41

answered Mar 10 at 14:19

vezunchik

1,3361515

Perfect!!! Thanks @vezunchik for your awesome solution. Make sure to add yield before scrapy.Request(response.urljoin(email_url)).

– robots.txt
Mar 10 at 16:57

Oops, you're totally right, this is typo :)

– vezunchik
Mar 10 at 18:42

add a comment |

I'd use response.meta, but if is required to avoid it, ok, let's try in another way: check lib https://pypi.org/project/scrapy-inline-requests/

from inline_requests import inline_requests


class YellowpagesSpider(scrapy.Spider):
 name = "yellowpages"
 start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

 @inline_requests
 def parse(self, response):
 for items in response.css("div.v-card .info"):
 name = items.css("a.business-name > span::text").get()
 phone = items.css("div.phones::text").get()

 email_url = items.css("a.business-name::attr(href)").get()
 email_resp = yield scrapy.Request(response.urljoin(email_url), meta='handle_httpstatus_all': True)
 email = email_resp.css("a.email-business[href^='mailto:']::attr(href)").get() if email_resp.status == 200 else None
 yield "Name": name, "Phone": phone, "Email": email

edited Mar 10 at 18:41

answered Mar 10 at 14:19

vezunchik

1,3361515

Perfect!!! Thanks @vezunchik for your awesome solution. Make sure to add yield before scrapy.Request(response.urljoin(email_url)).

– robots.txt
Mar 10 at 16:57

Oops, you're totally right, this is typo :)

– vezunchik
Mar 10 at 18:42

add a comment |

I'd use response.meta, but if is required to avoid it, ok, let's try in another way: check lib https://pypi.org/project/scrapy-inline-requests/

from inline_requests import inline_requests


class YellowpagesSpider(scrapy.Spider):
 name = "yellowpages"
 start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

 @inline_requests
 def parse(self, response):
 for items in response.css("div.v-card .info"):
 name = items.css("a.business-name > span::text").get()
 phone = items.css("div.phones::text").get()

 email_url = items.css("a.business-name::attr(href)").get()
 email_resp = yield scrapy.Request(response.urljoin(email_url), meta='handle_httpstatus_all': True)
 email = email_resp.css("a.email-business[href^='mailto:']::attr(href)").get() if email_resp.status == 200 else None
 yield "Name": name, "Phone": phone, "Email": email

edited Mar 10 at 18:41

answered Mar 10 at 14:19

vezunchik

1,3361515

I'd use response.meta, but if is required to avoid it, ok, let's try in another way: check lib https://pypi.org/project/scrapy-inline-requests/

from inline_requests import inline_requests


class YellowpagesSpider(scrapy.Spider):
 name = "yellowpages"
 start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

 @inline_requests
 def parse(self, response):
 for items in response.css("div.v-card .info"):
 name = items.css("a.business-name > span::text").get()
 phone = items.css("div.phones::text").get()

 email_url = items.css("a.business-name::attr(href)").get()
 email_resp = yield scrapy.Request(response.urljoin(email_url), meta='handle_httpstatus_all': True)
 email = email_resp.css("a.email-business[href^='mailto:']::attr(href)").get() if email_resp.status == 200 else None
 yield "Name": name, "Phone": phone, "Email": email

edited Mar 10 at 18:41

answered Mar 10 at 14:19

vezunchik

1,3361515

edited Mar 10 at 18:41

answered Mar 10 at 14:19

vezunchik

1,3361515

answered Mar 10 at 14:19

vezunchik

1,3361515

answered Mar 10 at 14:19

vezunchik

1,3361515

Perfect!!! Thanks @vezunchik for your awesome solution. Make sure to add yield before scrapy.Request(response.urljoin(email_url)).

– robots.txt
Mar 10 at 16:57

Oops, you're totally right, this is typo :)

– vezunchik
Mar 10 at 18:42

add a comment |

Perfect!!! Thanks @vezunchik for your awesome solution. Make sure to add yield before scrapy.Request(response.urljoin(email_url)).

– robots.txt
Mar 10 at 16:57

Oops, you're totally right, this is typo :)

– vezunchik
Mar 10 at 18:42

Perfect!!! Thanks @vezunchik for your awesome solution. Make sure to add yield before scrapy.Request(response.urljoin(email_url)).

– robots.txt
Mar 10 at 16:57

Oops, you're totally right, this is typo :)

– vezunchik
Mar 10 at 18:42

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ggtcf

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Can't initialize raids on a new ASUS Prime B360M-A motherboard2019 Community Moderator ElectionSimilar to RAID config yet more like mirroring solution?Can't get motherboard serial numberWhy does the BIOS entry point start with a WBINVD instruction?UEFI performance Asus Maximus V Extreme

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Can't initialize raids on a new ASUS Prime B360M-A motherboard2019 Community Moderator ElectionSimilar to RAID config yet more like mirroring solution?Can't get motherboard serial numberWhy does the BIOS entry point start with a WBINVD instruction?UEFI performance Asus Maximus V Extreme

1 Answer
1

1 Answer
1

1 Answer
1