Why is JSoup timing out at random places in my code?Why is subtracting these two times (in 1927) giving a strange result?Why does this code using random strings print “hello world”?jsoup posting JavaGWT 2.5.1 and Kindle paperwhite user agentHow Spring MVC make HttpServletRequest field threadsafe?Spring Java servlet return incorrect user agentHow to save the body content of New York Times links using jsoupWhy is executing Java code in comments with certain Unicode characters allowed?Jsoup catchdata appear unknowhost exception ,and can`t ping the website ,but my web browser can visitScrapy, can't crawl any page: “TCP connection timed out: 110: Connection timed out.”
Is it possible to create light that imparts a greater proportion of its energy as momentum rather than heat?
Can a virus destroy the BIOS of a modern computer?
Is it possible to download Internet Explorer on my Mac running OS X El Capitan?
Stopping power of mountain vs road bike
What exploit are these user agents trying to use?
Is the Joker left-handed?
Western buddy movie with a supernatural twist where a woman turns into an eagle at the end
A reference to a well-known characterization of scattered compact spaces
Is it canonical bit space?
Can I use a neutral wire from another outlet to repair a broken neutral?
RG-213 Cable with electric strained wire as metallic shield of Coaxial cable
If human space travel is limited by the G force vulnerability, is there a way to counter G forces?
How to say in German "enjoying home comforts"
Why is Collection not simply treated as Collection<?>
Why is it a bad idea to hire a hitman to eliminate most corrupt politicians?
Neighboring nodes in the network
What is the word for reserving something for yourself before others do?
How badly should I try to prevent a user from XSSing themselves?
What does it mean to describe someone as a butt steak?
How can I tell someone that I want to be his or her friend?
What is the PIE reconstruction for word-initial alpha with rough breathing?
What is going on with Captain Marvel's blood colour?
Can a rocket refuel on Mars from water?
Infinite Abelian subgroup of infinite non Abelian group example
Why is JSoup timing out at random places in my code?
Why is subtracting these two times (in 1927) giving a strange result?Why does this code using random strings print “hello world”?jsoup posting JavaGWT 2.5.1 and Kindle paperwhite user agentHow Spring MVC make HttpServletRequest field threadsafe?Spring Java servlet return incorrect user agentHow to save the body content of New York Times links using jsoupWhy is executing Java code in comments with certain Unicode characters allowed?Jsoup catchdata appear unknowhost exception ,and can`t ping the website ,but my web browser can visitScrapy, can't crawl any page: “TCP connection timed out: 110: Connection timed out.”
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I am currently trying to use JSoup in Java to scrape retrosheets.org for a baseball coding project I am working on.
I perform multiple JSoup connections in my code, and some of these connections are done in a loop (therefore are executed many many times). So, in total, I'm making hundreds of connections in my program to scrape the necessary data.
The program works for ~5 seconds but then gets hung up on a connection (a different one each time). Then, when I try to access the website separately in my browser the website will not load. What could be causing this? Is there an issue with performing too many connections?
Here is an example of a connection I am performing (all connections follow this same format).
doc = Jsoup.connect("https://www.retrosheet.org/boxesetc/index.html").maxBodySize(0).userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15").get();
This is the error I am getting
java web-scraping connection timeout jsoup
add a comment |
I am currently trying to use JSoup in Java to scrape retrosheets.org for a baseball coding project I am working on.
I perform multiple JSoup connections in my code, and some of these connections are done in a loop (therefore are executed many many times). So, in total, I'm making hundreds of connections in my program to scrape the necessary data.
The program works for ~5 seconds but then gets hung up on a connection (a different one each time). Then, when I try to access the website separately in my browser the website will not load. What could be causing this? Is there an issue with performing too many connections?
Here is an example of a connection I am performing (all connections follow this same format).
doc = Jsoup.connect("https://www.retrosheet.org/boxesetc/index.html").maxBodySize(0).userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15").get();
This is the error I am getting
java web-scraping connection timeout jsoup
add a comment |
I am currently trying to use JSoup in Java to scrape retrosheets.org for a baseball coding project I am working on.
I perform multiple JSoup connections in my code, and some of these connections are done in a loop (therefore are executed many many times). So, in total, I'm making hundreds of connections in my program to scrape the necessary data.
The program works for ~5 seconds but then gets hung up on a connection (a different one each time). Then, when I try to access the website separately in my browser the website will not load. What could be causing this? Is there an issue with performing too many connections?
Here is an example of a connection I am performing (all connections follow this same format).
doc = Jsoup.connect("https://www.retrosheet.org/boxesetc/index.html").maxBodySize(0).userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15").get();
This is the error I am getting
java web-scraping connection timeout jsoup
I am currently trying to use JSoup in Java to scrape retrosheets.org for a baseball coding project I am working on.
I perform multiple JSoup connections in my code, and some of these connections are done in a loop (therefore are executed many many times). So, in total, I'm making hundreds of connections in my program to scrape the necessary data.
The program works for ~5 seconds but then gets hung up on a connection (a different one each time). Then, when I try to access the website separately in my browser the website will not load. What could be causing this? Is there an issue with performing too many connections?
Here is an example of a connection I am performing (all connections follow this same format).
doc = Jsoup.connect("https://www.retrosheet.org/boxesetc/index.html").maxBodySize(0).userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15").get();
This is the error I am getting
java web-scraping connection timeout jsoup
java web-scraping connection timeout jsoup
edited Mar 8 at 23:43
Jacob Snyder
asked Mar 8 at 23:33
Jacob SnyderJacob Snyder
32
32
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
This is most definitely load protection on the target website side - it detects too many requests from same IP and blocks it for a while or throttles number of connections/requests from that IP. That's why you can't open the website in the browser as well - it's not about JSoup or Java at all, it's about connections/requests from your IP to target website being blocked/throttled.
Is there a way around this? Thank you for the answer.
– Jacob Snyder
Mar 9 at 0:00
Well, you could throttle your requests - e.g. insert delays in the code that does them. Also you could implement retries (optionally with a delay between retries as well). Also there might be a problem with a number of connections you create - JSoup will probably not reuse connections, but if you use Commons HTTPClient with a connection pooling connection manager - that one will. You could retrieve HTML via Commons HTTPClient and then use JSoup for parsing only (not using it's HTTP client capabilities). Best - do all of this (delays + retries + Commons HTTPClient for retrieval).
– mvmn
Mar 9 at 0:04
Here's the method to parse a String as HTML via JSoup (base URL parameter is there to allow JSoup provide absolute URLs from relative ones BTW): jsoup.org/apidocs/org/jsoup/…
– mvmn
Mar 9 at 0:06
P.S. If my answer properly addresses your problem - would you mind upvoting it and/or marking it as a correct answer? Thanks!
– mvmn
Mar 9 at 11:41
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55072441%2fwhy-is-jsoup-timing-out-at-random-places-in-my-code%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
This is most definitely load protection on the target website side - it detects too many requests from same IP and blocks it for a while or throttles number of connections/requests from that IP. That's why you can't open the website in the browser as well - it's not about JSoup or Java at all, it's about connections/requests from your IP to target website being blocked/throttled.
Is there a way around this? Thank you for the answer.
– Jacob Snyder
Mar 9 at 0:00
Well, you could throttle your requests - e.g. insert delays in the code that does them. Also you could implement retries (optionally with a delay between retries as well). Also there might be a problem with a number of connections you create - JSoup will probably not reuse connections, but if you use Commons HTTPClient with a connection pooling connection manager - that one will. You could retrieve HTML via Commons HTTPClient and then use JSoup for parsing only (not using it's HTTP client capabilities). Best - do all of this (delays + retries + Commons HTTPClient for retrieval).
– mvmn
Mar 9 at 0:04
Here's the method to parse a String as HTML via JSoup (base URL parameter is there to allow JSoup provide absolute URLs from relative ones BTW): jsoup.org/apidocs/org/jsoup/…
– mvmn
Mar 9 at 0:06
P.S. If my answer properly addresses your problem - would you mind upvoting it and/or marking it as a correct answer? Thanks!
– mvmn
Mar 9 at 11:41
add a comment |
This is most definitely load protection on the target website side - it detects too many requests from same IP and blocks it for a while or throttles number of connections/requests from that IP. That's why you can't open the website in the browser as well - it's not about JSoup or Java at all, it's about connections/requests from your IP to target website being blocked/throttled.
Is there a way around this? Thank you for the answer.
– Jacob Snyder
Mar 9 at 0:00
Well, you could throttle your requests - e.g. insert delays in the code that does them. Also you could implement retries (optionally with a delay between retries as well). Also there might be a problem with a number of connections you create - JSoup will probably not reuse connections, but if you use Commons HTTPClient with a connection pooling connection manager - that one will. You could retrieve HTML via Commons HTTPClient and then use JSoup for parsing only (not using it's HTTP client capabilities). Best - do all of this (delays + retries + Commons HTTPClient for retrieval).
– mvmn
Mar 9 at 0:04
Here's the method to parse a String as HTML via JSoup (base URL parameter is there to allow JSoup provide absolute URLs from relative ones BTW): jsoup.org/apidocs/org/jsoup/…
– mvmn
Mar 9 at 0:06
P.S. If my answer properly addresses your problem - would you mind upvoting it and/or marking it as a correct answer? Thanks!
– mvmn
Mar 9 at 11:41
add a comment |
This is most definitely load protection on the target website side - it detects too many requests from same IP and blocks it for a while or throttles number of connections/requests from that IP. That's why you can't open the website in the browser as well - it's not about JSoup or Java at all, it's about connections/requests from your IP to target website being blocked/throttled.
This is most definitely load protection on the target website side - it detects too many requests from same IP and blocks it for a while or throttles number of connections/requests from that IP. That's why you can't open the website in the browser as well - it's not about JSoup or Java at all, it's about connections/requests from your IP to target website being blocked/throttled.
answered Mar 8 at 23:50
mvmnmvmn
1,7691423
1,7691423
Is there a way around this? Thank you for the answer.
– Jacob Snyder
Mar 9 at 0:00
Well, you could throttle your requests - e.g. insert delays in the code that does them. Also you could implement retries (optionally with a delay between retries as well). Also there might be a problem with a number of connections you create - JSoup will probably not reuse connections, but if you use Commons HTTPClient with a connection pooling connection manager - that one will. You could retrieve HTML via Commons HTTPClient and then use JSoup for parsing only (not using it's HTTP client capabilities). Best - do all of this (delays + retries + Commons HTTPClient for retrieval).
– mvmn
Mar 9 at 0:04
Here's the method to parse a String as HTML via JSoup (base URL parameter is there to allow JSoup provide absolute URLs from relative ones BTW): jsoup.org/apidocs/org/jsoup/…
– mvmn
Mar 9 at 0:06
P.S. If my answer properly addresses your problem - would you mind upvoting it and/or marking it as a correct answer? Thanks!
– mvmn
Mar 9 at 11:41
add a comment |
Is there a way around this? Thank you for the answer.
– Jacob Snyder
Mar 9 at 0:00
Well, you could throttle your requests - e.g. insert delays in the code that does them. Also you could implement retries (optionally with a delay between retries as well). Also there might be a problem with a number of connections you create - JSoup will probably not reuse connections, but if you use Commons HTTPClient with a connection pooling connection manager - that one will. You could retrieve HTML via Commons HTTPClient and then use JSoup for parsing only (not using it's HTTP client capabilities). Best - do all of this (delays + retries + Commons HTTPClient for retrieval).
– mvmn
Mar 9 at 0:04
Here's the method to parse a String as HTML via JSoup (base URL parameter is there to allow JSoup provide absolute URLs from relative ones BTW): jsoup.org/apidocs/org/jsoup/…
– mvmn
Mar 9 at 0:06
P.S. If my answer properly addresses your problem - would you mind upvoting it and/or marking it as a correct answer? Thanks!
– mvmn
Mar 9 at 11:41
Is there a way around this? Thank you for the answer.
– Jacob Snyder
Mar 9 at 0:00
Is there a way around this? Thank you for the answer.
– Jacob Snyder
Mar 9 at 0:00
Well, you could throttle your requests - e.g. insert delays in the code that does them. Also you could implement retries (optionally with a delay between retries as well). Also there might be a problem with a number of connections you create - JSoup will probably not reuse connections, but if you use Commons HTTPClient with a connection pooling connection manager - that one will. You could retrieve HTML via Commons HTTPClient and then use JSoup for parsing only (not using it's HTTP client capabilities). Best - do all of this (delays + retries + Commons HTTPClient for retrieval).
– mvmn
Mar 9 at 0:04
Well, you could throttle your requests - e.g. insert delays in the code that does them. Also you could implement retries (optionally with a delay between retries as well). Also there might be a problem with a number of connections you create - JSoup will probably not reuse connections, but if you use Commons HTTPClient with a connection pooling connection manager - that one will. You could retrieve HTML via Commons HTTPClient and then use JSoup for parsing only (not using it's HTTP client capabilities). Best - do all of this (delays + retries + Commons HTTPClient for retrieval).
– mvmn
Mar 9 at 0:04
Here's the method to parse a String as HTML via JSoup (base URL parameter is there to allow JSoup provide absolute URLs from relative ones BTW): jsoup.org/apidocs/org/jsoup/…
– mvmn
Mar 9 at 0:06
Here's the method to parse a String as HTML via JSoup (base URL parameter is there to allow JSoup provide absolute URLs from relative ones BTW): jsoup.org/apidocs/org/jsoup/…
– mvmn
Mar 9 at 0:06
P.S. If my answer properly addresses your problem - would you mind upvoting it and/or marking it as a correct answer? Thanks!
– mvmn
Mar 9 at 11:41
P.S. If my answer properly addresses your problem - would you mind upvoting it and/or marking it as a correct answer? Thanks!
– mvmn
Mar 9 at 11:41
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55072441%2fwhy-is-jsoup-timing-out-at-random-places-in-my-code%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown