Can't parse some content using a customized methodDifference between append vs. extend list methods in PythonHow do I parse a string to a float or int in Python?Understanding Python super() with __init__() methodsStatic methods in Python?Proper way to declare custom exceptions in modern Python?How do I parse XML in Python?Why can't Python parse this JSON data?Does Python have a string 'contains' substring method?UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)Python - Passing argument of a function in a class

Is there an expression that means doing something right before you will need it rather than doing it in case you might need it?

Size of subfigure fitting its content (tikzpicture)

Short story with a alien planet, government officials must wear exploding medallions

Avoiding direct proof while writing proof by induction

What killed these X2 caps?

Should I tell management that I intend to leave due to bad software development practices?

What does the expression "A Mann!" means

ssTTsSTtRrriinInnnnNNNIiinngg

How seriously should I take size and weight limits of hand luggage?

CAST throwing error when run in stored procedure but not when run as raw query

How can I determine if the org that I'm currently connected to is a scratch org?

I would say: "You are another teacher", but she is a woman and I am a man

Alternative to sending password over mail?

Why do bosons tend to occupy the same state?

What mechanic is there to disable a threat instead of killing it?

How could indestructible materials be used in power generation?

What's the in-universe reasoning behind sorcerers needing material components?

How badly should I try to prevent a user from XSSing themselves?

Avoiding the "not like other girls" trope?

Why didn't Boeing produce its own regional jet?

Is "remove commented out code" correct English?

Reverse dictionary where values are lists

Why was the shrinking from 8″ made only to 5.25″ and not smaller (4″ or less)?

One verb to replace 'be a member of' a club



Can't parse some content using a customized method


Difference between append vs. extend list methods in PythonHow do I parse a string to a float or int in Python?Understanding Python super() with __init__() methodsStatic methods in Python?Proper way to declare custom exceptions in modern Python?How do I parse XML in Python?Why can't Python parse this JSON data?Does Python have a string 'contains' substring method?UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)Python - Passing argument of a function in a class













0















I've written a script using scrapy to get the name, phone number and email from a website. The content I'm after are available in two diferent links, as in name and phone in one link and the email is in another link. I've used here yellowpages.com as an example and tried to implement the logic in such a way so that I can parse the email even when I'm in it's landing page. It's a requirement that I can't use meta. However, I used requests and BeautifulSoup in combination with scrapy to accomplish the job complying the above condition but it is real slow.



Working one (along with requests and BeautifulSoup):



import scrapy
import requests
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess

def get_email(target_link):
res = requests.get(target_link)
soup = BeautifulSoup(res.text,"lxml")
email = soup.select_one("a.email-business[href^='mailto:']")
if email:
return email.get("href")
else:
return None

class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

def parse(self,response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email = get_email(response.urljoin(items.css("a.business-name::attr(href)").get()))
yield "Name":name,"Phone":phone,"Email":email

if __name__ == "__main__":
c = CrawlerProcess(
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(YellowpagesSpider)
c.start()


I'm trying to mimic the above concept without requests and BeautifulSoup but can't make it work.



import scrapy
from scrapy.crawler import CrawlerProcess

class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

def parse(self,response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email_link = response.urljoin(items.css("a.business-name::attr(href)").get())

#CANT APPLY THE LOGIC IN THE FOLLOWING LINE

email = self.get_email(email_link)
yield "Name":name,"Phone":phone,"Email":email

def get_email(self,link):
email = response.css("a.email-business[href^='mailto:']::attr(href)").get()
return email

if __name__ == "__main__":
c = CrawlerProcess(
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(YellowpagesSpider)
c.start()


How can I make my second script work mimicking the first script?










share|improve this question






















  • Why can't you use request.meta? It's the appropriate tool for the job.

    – stranac
    Mar 9 at 8:47















0















I've written a script using scrapy to get the name, phone number and email from a website. The content I'm after are available in two diferent links, as in name and phone in one link and the email is in another link. I've used here yellowpages.com as an example and tried to implement the logic in such a way so that I can parse the email even when I'm in it's landing page. It's a requirement that I can't use meta. However, I used requests and BeautifulSoup in combination with scrapy to accomplish the job complying the above condition but it is real slow.



Working one (along with requests and BeautifulSoup):



import scrapy
import requests
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess

def get_email(target_link):
res = requests.get(target_link)
soup = BeautifulSoup(res.text,"lxml")
email = soup.select_one("a.email-business[href^='mailto:']")
if email:
return email.get("href")
else:
return None

class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

def parse(self,response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email = get_email(response.urljoin(items.css("a.business-name::attr(href)").get()))
yield "Name":name,"Phone":phone,"Email":email

if __name__ == "__main__":
c = CrawlerProcess(
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(YellowpagesSpider)
c.start()


I'm trying to mimic the above concept without requests and BeautifulSoup but can't make it work.



import scrapy
from scrapy.crawler import CrawlerProcess

class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

def parse(self,response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email_link = response.urljoin(items.css("a.business-name::attr(href)").get())

#CANT APPLY THE LOGIC IN THE FOLLOWING LINE

email = self.get_email(email_link)
yield "Name":name,"Phone":phone,"Email":email

def get_email(self,link):
email = response.css("a.email-business[href^='mailto:']::attr(href)").get()
return email

if __name__ == "__main__":
c = CrawlerProcess(
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(YellowpagesSpider)
c.start()


How can I make my second script work mimicking the first script?










share|improve this question






















  • Why can't you use request.meta? It's the appropriate tool for the job.

    – stranac
    Mar 9 at 8:47













0












0








0


1






I've written a script using scrapy to get the name, phone number and email from a website. The content I'm after are available in two diferent links, as in name and phone in one link and the email is in another link. I've used here yellowpages.com as an example and tried to implement the logic in such a way so that I can parse the email even when I'm in it's landing page. It's a requirement that I can't use meta. However, I used requests and BeautifulSoup in combination with scrapy to accomplish the job complying the above condition but it is real slow.



Working one (along with requests and BeautifulSoup):



import scrapy
import requests
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess

def get_email(target_link):
res = requests.get(target_link)
soup = BeautifulSoup(res.text,"lxml")
email = soup.select_one("a.email-business[href^='mailto:']")
if email:
return email.get("href")
else:
return None

class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

def parse(self,response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email = get_email(response.urljoin(items.css("a.business-name::attr(href)").get()))
yield "Name":name,"Phone":phone,"Email":email

if __name__ == "__main__":
c = CrawlerProcess(
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(YellowpagesSpider)
c.start()


I'm trying to mimic the above concept without requests and BeautifulSoup but can't make it work.



import scrapy
from scrapy.crawler import CrawlerProcess

class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

def parse(self,response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email_link = response.urljoin(items.css("a.business-name::attr(href)").get())

#CANT APPLY THE LOGIC IN THE FOLLOWING LINE

email = self.get_email(email_link)
yield "Name":name,"Phone":phone,"Email":email

def get_email(self,link):
email = response.css("a.email-business[href^='mailto:']::attr(href)").get()
return email

if __name__ == "__main__":
c = CrawlerProcess(
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(YellowpagesSpider)
c.start()


How can I make my second script work mimicking the first script?










share|improve this question














I've written a script using scrapy to get the name, phone number and email from a website. The content I'm after are available in two diferent links, as in name and phone in one link and the email is in another link. I've used here yellowpages.com as an example and tried to implement the logic in such a way so that I can parse the email even when I'm in it's landing page. It's a requirement that I can't use meta. However, I used requests and BeautifulSoup in combination with scrapy to accomplish the job complying the above condition but it is real slow.



Working one (along with requests and BeautifulSoup):



import scrapy
import requests
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess

def get_email(target_link):
res = requests.get(target_link)
soup = BeautifulSoup(res.text,"lxml")
email = soup.select_one("a.email-business[href^='mailto:']")
if email:
return email.get("href")
else:
return None

class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

def parse(self,response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email = get_email(response.urljoin(items.css("a.business-name::attr(href)").get()))
yield "Name":name,"Phone":phone,"Email":email

if __name__ == "__main__":
c = CrawlerProcess(
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(YellowpagesSpider)
c.start()


I'm trying to mimic the above concept without requests and BeautifulSoup but can't make it work.



import scrapy
from scrapy.crawler import CrawlerProcess

class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

def parse(self,response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email_link = response.urljoin(items.css("a.business-name::attr(href)").get())

#CANT APPLY THE LOGIC IN THE FOLLOWING LINE

email = self.get_email(email_link)
yield "Name":name,"Phone":phone,"Email":email

def get_email(self,link):
email = response.css("a.email-business[href^='mailto:']::attr(href)").get()
return email

if __name__ == "__main__":
c = CrawlerProcess(
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(YellowpagesSpider)
c.start()


How can I make my second script work mimicking the first script?







python python-3.x web-scraping scrapy






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 8 at 22:37









robots.txtrobots.txt

339118




339118












  • Why can't you use request.meta? It's the appropriate tool for the job.

    – stranac
    Mar 9 at 8:47

















  • Why can't you use request.meta? It's the appropriate tool for the job.

    – stranac
    Mar 9 at 8:47
















Why can't you use request.meta? It's the appropriate tool for the job.

– stranac
Mar 9 at 8:47





Why can't you use request.meta? It's the appropriate tool for the job.

– stranac
Mar 9 at 8:47












1 Answer
1






active

oldest

votes


















1














I'd use response.meta, but if is required to avoid it, ok, let's try in another way: check lib https://pypi.org/project/scrapy-inline-requests/



from inline_requests import inline_requests


class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

@inline_requests
def parse(self, response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()

email_url = items.css("a.business-name::attr(href)").get()
email_resp = yield scrapy.Request(response.urljoin(email_url), meta='handle_httpstatus_all': True)
email = email_resp.css("a.email-business[href^='mailto:']::attr(href)").get() if email_resp.status == 200 else None
yield "Name": name, "Phone": phone, "Email": email





share|improve this answer

























  • Perfect!!! Thanks @vezunchik for your awesome solution. Make sure to add yield before scrapy.Request(response.urljoin(email_url)).

    – robots.txt
    Mar 10 at 16:57











  • Oops, you're totally right, this is typo :)

    – vezunchik
    Mar 10 at 18:42











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55071962%2fcant-parse-some-content-using-a-customized-method%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














I'd use response.meta, but if is required to avoid it, ok, let's try in another way: check lib https://pypi.org/project/scrapy-inline-requests/



from inline_requests import inline_requests


class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

@inline_requests
def parse(self, response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()

email_url = items.css("a.business-name::attr(href)").get()
email_resp = yield scrapy.Request(response.urljoin(email_url), meta='handle_httpstatus_all': True)
email = email_resp.css("a.email-business[href^='mailto:']::attr(href)").get() if email_resp.status == 200 else None
yield "Name": name, "Phone": phone, "Email": email





share|improve this answer

























  • Perfect!!! Thanks @vezunchik for your awesome solution. Make sure to add yield before scrapy.Request(response.urljoin(email_url)).

    – robots.txt
    Mar 10 at 16:57











  • Oops, you're totally right, this is typo :)

    – vezunchik
    Mar 10 at 18:42















1














I'd use response.meta, but if is required to avoid it, ok, let's try in another way: check lib https://pypi.org/project/scrapy-inline-requests/



from inline_requests import inline_requests


class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

@inline_requests
def parse(self, response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()

email_url = items.css("a.business-name::attr(href)").get()
email_resp = yield scrapy.Request(response.urljoin(email_url), meta='handle_httpstatus_all': True)
email = email_resp.css("a.email-business[href^='mailto:']::attr(href)").get() if email_resp.status == 200 else None
yield "Name": name, "Phone": phone, "Email": email





share|improve this answer

























  • Perfect!!! Thanks @vezunchik for your awesome solution. Make sure to add yield before scrapy.Request(response.urljoin(email_url)).

    – robots.txt
    Mar 10 at 16:57











  • Oops, you're totally right, this is typo :)

    – vezunchik
    Mar 10 at 18:42













1












1








1







I'd use response.meta, but if is required to avoid it, ok, let's try in another way: check lib https://pypi.org/project/scrapy-inline-requests/



from inline_requests import inline_requests


class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

@inline_requests
def parse(self, response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()

email_url = items.css("a.business-name::attr(href)").get()
email_resp = yield scrapy.Request(response.urljoin(email_url), meta='handle_httpstatus_all': True)
email = email_resp.css("a.email-business[href^='mailto:']::attr(href)").get() if email_resp.status == 200 else None
yield "Name": name, "Phone": phone, "Email": email





share|improve this answer















I'd use response.meta, but if is required to avoid it, ok, let's try in another way: check lib https://pypi.org/project/scrapy-inline-requests/



from inline_requests import inline_requests


class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

@inline_requests
def parse(self, response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()

email_url = items.css("a.business-name::attr(href)").get()
email_resp = yield scrapy.Request(response.urljoin(email_url), meta='handle_httpstatus_all': True)
email = email_resp.css("a.email-business[href^='mailto:']::attr(href)").get() if email_resp.status == 200 else None
yield "Name": name, "Phone": phone, "Email": email






share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 10 at 18:41

























answered Mar 10 at 14:19









vezunchikvezunchik

1,3361515




1,3361515












  • Perfect!!! Thanks @vezunchik for your awesome solution. Make sure to add yield before scrapy.Request(response.urljoin(email_url)).

    – robots.txt
    Mar 10 at 16:57











  • Oops, you're totally right, this is typo :)

    – vezunchik
    Mar 10 at 18:42

















  • Perfect!!! Thanks @vezunchik for your awesome solution. Make sure to add yield before scrapy.Request(response.urljoin(email_url)).

    – robots.txt
    Mar 10 at 16:57











  • Oops, you're totally right, this is typo :)

    – vezunchik
    Mar 10 at 18:42
















Perfect!!! Thanks @vezunchik for your awesome solution. Make sure to add yield before scrapy.Request(response.urljoin(email_url)).

– robots.txt
Mar 10 at 16:57





Perfect!!! Thanks @vezunchik for your awesome solution. Make sure to add yield before scrapy.Request(response.urljoin(email_url)).

– robots.txt
Mar 10 at 16:57













Oops, you're totally right, this is typo :)

– vezunchik
Mar 10 at 18:42





Oops, you're totally right, this is typo :)

– vezunchik
Mar 10 at 18:42



















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55071962%2fcant-parse-some-content-using-a-customized-method%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Identity Server 4 is not redirecting to Angular app after login2019 Community Moderator ElectionIdentity Server 4 and dockerIdentityserver implicit flow unauthorized_clientIdentityServer Hybrid Flow - Access Token is null after user successful loginIdentity Server to MVC client : Page Redirect After loginLogin with Steam OpenId(oidc-client-js)Identity Server 4+.NET Core 2.0 + IdentityIdentityServer4 post-login redirect not working in Edge browserCall to IdentityServer4 generates System.NullReferenceException: Object reference not set to an instance of an objectIdentityServer4 without HTTPS not workingHow to get Authorization code from identity server without login form

2005 Ahvaz unrest Contents Background Causes Casualties Aftermath See also References Navigation menue"At Least 10 Are Killed by Bombs in Iran""Iran"Archived"Arab-Iranians in Iran to make April 15 'Day of Fury'"State of Mind, State of Order: Reactions to Ethnic Unrest in the Islamic Republic of Iran.10.1111/j.1754-9469.2008.00028.x"Iran hangs Arab separatists"Iran Overview from ArchivedConstitution of the Islamic Republic of Iran"Tehran puzzled by forged 'riots' letter""Iran and its minorities: Down in the second class""Iran: Handling Of Ahvaz Unrest Could End With Televised Confessions""Bombings Rock Iran Ahead of Election""Five die in Iran ethnic clashes""Iran: Need for restraint as anniversary of unrest in Khuzestan approaches"Archived"Iranian Sunni protesters killed in clashes with security forces"Archived

Can't initialize raids on a new ASUS Prime B360M-A motherboard2019 Community Moderator ElectionSimilar to RAID config yet more like mirroring solution?Can't get motherboard serial numberWhy does the BIOS entry point start with a WBINVD instruction?UEFI performance Asus Maximus V Extreme