Can't parse some content using a customized methodDifference between append vs. extend list methods in PythonHow do I parse a string to a float or int in Python?Understanding Python super() with __init__() methodsStatic methods in Python?Proper way to declare custom exceptions in modern Python?How do I parse XML in Python?Why can't Python parse this JSON data?Does Python have a string 'contains' substring method?UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)Python - Passing argument of a function in a class
Is there an expression that means doing something right before you will need it rather than doing it in case you might need it?
Size of subfigure fitting its content (tikzpicture)
Short story with a alien planet, government officials must wear exploding medallions
Avoiding direct proof while writing proof by induction
What killed these X2 caps?
Should I tell management that I intend to leave due to bad software development practices?
What does the expression "A Mann!" means
ssTTsSTtRrriinInnnnNNNIiinngg
How seriously should I take size and weight limits of hand luggage?
CAST throwing error when run in stored procedure but not when run as raw query
How can I determine if the org that I'm currently connected to is a scratch org?
I would say: "You are another teacher", but she is a woman and I am a man
Alternative to sending password over mail?
Why do bosons tend to occupy the same state?
What mechanic is there to disable a threat instead of killing it?
How could indestructible materials be used in power generation?
What's the in-universe reasoning behind sorcerers needing material components?
How badly should I try to prevent a user from XSSing themselves?
Avoiding the "not like other girls" trope?
Why didn't Boeing produce its own regional jet?
Is "remove commented out code" correct English?
Reverse dictionary where values are lists
Why was the shrinking from 8″ made only to 5.25″ and not smaller (4″ or less)?
One verb to replace 'be a member of' a club
Can't parse some content using a customized method
Difference between append vs. extend list methods in PythonHow do I parse a string to a float or int in Python?Understanding Python super() with __init__() methodsStatic methods in Python?Proper way to declare custom exceptions in modern Python?How do I parse XML in Python?Why can't Python parse this JSON data?Does Python have a string 'contains' substring method?UnicodeEncodeError: 'ascii' codec can't encode character u'xa0' in position 20: ordinal not in range(128)Python - Passing argument of a function in a class
I've written a script using scrapy to get the name
, phone
number and email
from a website. The content I'm after are available in two diferent links, as in name
and phone
in one link and the email
is in another link. I've used here yellowpages.com
as an example and tried to implement the logic in such a way so that I can parse the email
even when I'm in it's landing page. It's a requirement that I can't use meta. However, I used requests
and BeautifulSoup
in combination with scrapy to accomplish the job complying the above condition but it is real slow.
Working one (along with requests
and BeautifulSoup
):
import scrapy
import requests
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess
def get_email(target_link):
res = requests.get(target_link)
soup = BeautifulSoup(res.text,"lxml")
email = soup.select_one("a.email-business[href^='mailto:']")
if email:
return email.get("href")
else:
return None
class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]
def parse(self,response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email = get_email(response.urljoin(items.css("a.business-name::attr(href)").get()))
yield "Name":name,"Phone":phone,"Email":email
if __name__ == "__main__":
c = CrawlerProcess(
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(YellowpagesSpider)
c.start()
I'm trying to mimic the above concept without requests
and BeautifulSoup
but can't make it work.
import scrapy
from scrapy.crawler import CrawlerProcess
class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]
def parse(self,response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email_link = response.urljoin(items.css("a.business-name::attr(href)").get())
#CANT APPLY THE LOGIC IN THE FOLLOWING LINE
email = self.get_email(email_link)
yield "Name":name,"Phone":phone,"Email":email
def get_email(self,link):
email = response.css("a.email-business[href^='mailto:']::attr(href)").get()
return email
if __name__ == "__main__":
c = CrawlerProcess(
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(YellowpagesSpider)
c.start()
How can I make my second script work mimicking the first script?
python python-3.x web-scraping scrapy
add a comment |
I've written a script using scrapy to get the name
, phone
number and email
from a website. The content I'm after are available in two diferent links, as in name
and phone
in one link and the email
is in another link. I've used here yellowpages.com
as an example and tried to implement the logic in such a way so that I can parse the email
even when I'm in it's landing page. It's a requirement that I can't use meta. However, I used requests
and BeautifulSoup
in combination with scrapy to accomplish the job complying the above condition but it is real slow.
Working one (along with requests
and BeautifulSoup
):
import scrapy
import requests
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess
def get_email(target_link):
res = requests.get(target_link)
soup = BeautifulSoup(res.text,"lxml")
email = soup.select_one("a.email-business[href^='mailto:']")
if email:
return email.get("href")
else:
return None
class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]
def parse(self,response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email = get_email(response.urljoin(items.css("a.business-name::attr(href)").get()))
yield "Name":name,"Phone":phone,"Email":email
if __name__ == "__main__":
c = CrawlerProcess(
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(YellowpagesSpider)
c.start()
I'm trying to mimic the above concept without requests
and BeautifulSoup
but can't make it work.
import scrapy
from scrapy.crawler import CrawlerProcess
class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]
def parse(self,response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email_link = response.urljoin(items.css("a.business-name::attr(href)").get())
#CANT APPLY THE LOGIC IN THE FOLLOWING LINE
email = self.get_email(email_link)
yield "Name":name,"Phone":phone,"Email":email
def get_email(self,link):
email = response.css("a.email-business[href^='mailto:']::attr(href)").get()
return email
if __name__ == "__main__":
c = CrawlerProcess(
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(YellowpagesSpider)
c.start()
How can I make my second script work mimicking the first script?
python python-3.x web-scraping scrapy
Why can't you userequest.meta
? It's the appropriate tool for the job.
– stranac
Mar 9 at 8:47
add a comment |
I've written a script using scrapy to get the name
, phone
number and email
from a website. The content I'm after are available in two diferent links, as in name
and phone
in one link and the email
is in another link. I've used here yellowpages.com
as an example and tried to implement the logic in such a way so that I can parse the email
even when I'm in it's landing page. It's a requirement that I can't use meta. However, I used requests
and BeautifulSoup
in combination with scrapy to accomplish the job complying the above condition but it is real slow.
Working one (along with requests
and BeautifulSoup
):
import scrapy
import requests
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess
def get_email(target_link):
res = requests.get(target_link)
soup = BeautifulSoup(res.text,"lxml")
email = soup.select_one("a.email-business[href^='mailto:']")
if email:
return email.get("href")
else:
return None
class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]
def parse(self,response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email = get_email(response.urljoin(items.css("a.business-name::attr(href)").get()))
yield "Name":name,"Phone":phone,"Email":email
if __name__ == "__main__":
c = CrawlerProcess(
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(YellowpagesSpider)
c.start()
I'm trying to mimic the above concept without requests
and BeautifulSoup
but can't make it work.
import scrapy
from scrapy.crawler import CrawlerProcess
class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]
def parse(self,response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email_link = response.urljoin(items.css("a.business-name::attr(href)").get())
#CANT APPLY THE LOGIC IN THE FOLLOWING LINE
email = self.get_email(email_link)
yield "Name":name,"Phone":phone,"Email":email
def get_email(self,link):
email = response.css("a.email-business[href^='mailto:']::attr(href)").get()
return email
if __name__ == "__main__":
c = CrawlerProcess(
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(YellowpagesSpider)
c.start()
How can I make my second script work mimicking the first script?
python python-3.x web-scraping scrapy
I've written a script using scrapy to get the name
, phone
number and email
from a website. The content I'm after are available in two diferent links, as in name
and phone
in one link and the email
is in another link. I've used here yellowpages.com
as an example and tried to implement the logic in such a way so that I can parse the email
even when I'm in it's landing page. It's a requirement that I can't use meta. However, I used requests
and BeautifulSoup
in combination with scrapy to accomplish the job complying the above condition but it is real slow.
Working one (along with requests
and BeautifulSoup
):
import scrapy
import requests
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess
def get_email(target_link):
res = requests.get(target_link)
soup = BeautifulSoup(res.text,"lxml")
email = soup.select_one("a.email-business[href^='mailto:']")
if email:
return email.get("href")
else:
return None
class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]
def parse(self,response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email = get_email(response.urljoin(items.css("a.business-name::attr(href)").get()))
yield "Name":name,"Phone":phone,"Email":email
if __name__ == "__main__":
c = CrawlerProcess(
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(YellowpagesSpider)
c.start()
I'm trying to mimic the above concept without requests
and BeautifulSoup
but can't make it work.
import scrapy
from scrapy.crawler import CrawlerProcess
class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]
def parse(self,response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email_link = response.urljoin(items.css("a.business-name::attr(href)").get())
#CANT APPLY THE LOGIC IN THE FOLLOWING LINE
email = self.get_email(email_link)
yield "Name":name,"Phone":phone,"Email":email
def get_email(self,link):
email = response.css("a.email-business[href^='mailto:']::attr(href)").get()
return email
if __name__ == "__main__":
c = CrawlerProcess(
'USER_AGENT': 'Mozilla/5.0',
)
c.crawl(YellowpagesSpider)
c.start()
How can I make my second script work mimicking the first script?
python python-3.x web-scraping scrapy
python python-3.x web-scraping scrapy
asked Mar 8 at 22:37
robots.txtrobots.txt
339118
339118
Why can't you userequest.meta
? It's the appropriate tool for the job.
– stranac
Mar 9 at 8:47
add a comment |
Why can't you userequest.meta
? It's the appropriate tool for the job.
– stranac
Mar 9 at 8:47
Why can't you use
request.meta
? It's the appropriate tool for the job.– stranac
Mar 9 at 8:47
Why can't you use
request.meta
? It's the appropriate tool for the job.– stranac
Mar 9 at 8:47
add a comment |
1 Answer
1
active
oldest
votes
I'd use response.meta
, but if is required to avoid it, ok, let's try in another way: check lib https://pypi.org/project/scrapy-inline-requests/
from inline_requests import inline_requests
class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]
@inline_requests
def parse(self, response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email_url = items.css("a.business-name::attr(href)").get()
email_resp = yield scrapy.Request(response.urljoin(email_url), meta='handle_httpstatus_all': True)
email = email_resp.css("a.email-business[href^='mailto:']::attr(href)").get() if email_resp.status == 200 else None
yield "Name": name, "Phone": phone, "Email": email
Perfect!!! Thanks @vezunchik for your awesome solution. Make sure to addyield
beforescrapy.Request(response.urljoin(email_url))
.
– robots.txt
Mar 10 at 16:57
Oops, you're totally right, this is typo :)
– vezunchik
Mar 10 at 18:42
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55071962%2fcant-parse-some-content-using-a-customized-method%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I'd use response.meta
, but if is required to avoid it, ok, let's try in another way: check lib https://pypi.org/project/scrapy-inline-requests/
from inline_requests import inline_requests
class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]
@inline_requests
def parse(self, response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email_url = items.css("a.business-name::attr(href)").get()
email_resp = yield scrapy.Request(response.urljoin(email_url), meta='handle_httpstatus_all': True)
email = email_resp.css("a.email-business[href^='mailto:']::attr(href)").get() if email_resp.status == 200 else None
yield "Name": name, "Phone": phone, "Email": email
Perfect!!! Thanks @vezunchik for your awesome solution. Make sure to addyield
beforescrapy.Request(response.urljoin(email_url))
.
– robots.txt
Mar 10 at 16:57
Oops, you're totally right, this is typo :)
– vezunchik
Mar 10 at 18:42
add a comment |
I'd use response.meta
, but if is required to avoid it, ok, let's try in another way: check lib https://pypi.org/project/scrapy-inline-requests/
from inline_requests import inline_requests
class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]
@inline_requests
def parse(self, response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email_url = items.css("a.business-name::attr(href)").get()
email_resp = yield scrapy.Request(response.urljoin(email_url), meta='handle_httpstatus_all': True)
email = email_resp.css("a.email-business[href^='mailto:']::attr(href)").get() if email_resp.status == 200 else None
yield "Name": name, "Phone": phone, "Email": email
Perfect!!! Thanks @vezunchik for your awesome solution. Make sure to addyield
beforescrapy.Request(response.urljoin(email_url))
.
– robots.txt
Mar 10 at 16:57
Oops, you're totally right, this is typo :)
– vezunchik
Mar 10 at 18:42
add a comment |
I'd use response.meta
, but if is required to avoid it, ok, let's try in another way: check lib https://pypi.org/project/scrapy-inline-requests/
from inline_requests import inline_requests
class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]
@inline_requests
def parse(self, response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email_url = items.css("a.business-name::attr(href)").get()
email_resp = yield scrapy.Request(response.urljoin(email_url), meta='handle_httpstatus_all': True)
email = email_resp.css("a.email-business[href^='mailto:']::attr(href)").get() if email_resp.status == 200 else None
yield "Name": name, "Phone": phone, "Email": email
I'd use response.meta
, but if is required to avoid it, ok, let's try in another way: check lib https://pypi.org/project/scrapy-inline-requests/
from inline_requests import inline_requests
class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]
@inline_requests
def parse(self, response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email_url = items.css("a.business-name::attr(href)").get()
email_resp = yield scrapy.Request(response.urljoin(email_url), meta='handle_httpstatus_all': True)
email = email_resp.css("a.email-business[href^='mailto:']::attr(href)").get() if email_resp.status == 200 else None
yield "Name": name, "Phone": phone, "Email": email
edited Mar 10 at 18:41
answered Mar 10 at 14:19
vezunchikvezunchik
1,3361515
1,3361515
Perfect!!! Thanks @vezunchik for your awesome solution. Make sure to addyield
beforescrapy.Request(response.urljoin(email_url))
.
– robots.txt
Mar 10 at 16:57
Oops, you're totally right, this is typo :)
– vezunchik
Mar 10 at 18:42
add a comment |
Perfect!!! Thanks @vezunchik for your awesome solution. Make sure to addyield
beforescrapy.Request(response.urljoin(email_url))
.
– robots.txt
Mar 10 at 16:57
Oops, you're totally right, this is typo :)
– vezunchik
Mar 10 at 18:42
Perfect!!! Thanks @vezunchik for your awesome solution. Make sure to add
yield
before scrapy.Request(response.urljoin(email_url))
.– robots.txt
Mar 10 at 16:57
Perfect!!! Thanks @vezunchik for your awesome solution. Make sure to add
yield
before scrapy.Request(response.urljoin(email_url))
.– robots.txt
Mar 10 at 16:57
Oops, you're totally right, this is typo :)
– vezunchik
Mar 10 at 18:42
Oops, you're totally right, this is typo :)
– vezunchik
Mar 10 at 18:42
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55071962%2fcant-parse-some-content-using-a-customized-method%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Why can't you use
request.meta
? It's the appropriate tool for the job.– stranac
Mar 9 at 8:47