ElementTree does not seem to get some texts/elements in the tree2019 Community Moderator ElectionHow to get all possible combinations of a list’s elements?Getting the last element of a list in PythonHow to get the number of elements in a list in Python?Python ElementTree: Parsing a string and getting ElementTree instanceHow to get all sub-elements of an element tree with Python ElementTree?Find element by text with XPath in ElementTreeHow to set ElementTree Element text field in the constructorPython ElementTree doesn't seem to recognize text nodesElementTree Returns Element Instead of ElementTreeHow to get the XML element text with ElementTree in Python
If I can solve Sudoku can I solve Travelling Salesman Problem(TSP)? If yes, how?
Combining an idiom with a metonymy
How can I track script which gives me "command not found" right after the login?
Existence of subset with given Hausdorff dimension
Is a party consisting of only a bard, a cleric, and a warlock functional long-term?
Life insurance that covers only simultaneous/dual deaths
Time travel from stationary position?
Brexit - No Deal Rejection
What has been your most complicated TikZ drawing?
Is it possible to upcast ritual spells?
compactness of a set where am I going wrong
What's the meaning of “spike” in the context of “adrenaline spike”?
Have researchers managed to "reverse time"? If so, what does that mean for physics?
A Cautionary Suggestion
How Could an Airship Be Repaired Mid-Flight
Does Mathematica reuse previous computations?
What approach do we need to follow for projects without a test environment?
Do the common programs (for example: "ls", "cat") in Linux and BSD come from the same source code?
The difference between「N分で」and「後N分で」
Co-worker team leader wants to inject his friend's awful software into our development. What should I say to our common boss?
How do anti-virus programs start at Windows boot?
If curse and magic is two sides of the same coin, why the former is forbidden?
Do I need life insurance if I can cover my own funeral costs?
How to change two letters closest to a string and one letter immediately after a string using notepad++
ElementTree does not seem to get some texts/elements in the tree
2019 Community Moderator ElectionHow to get all possible combinations of a list’s elements?Getting the last element of a list in PythonHow to get the number of elements in a list in Python?Python ElementTree: Parsing a string and getting ElementTree instanceHow to get all sub-elements of an element tree with Python ElementTree?Find element by text with XPath in ElementTreeHow to set ElementTree Element text field in the constructorPython ElementTree doesn't seem to recognize text nodesElementTree Returns Element Instead of ElementTreeHow to get the XML element text with ElementTree in Python
I have a wikipedia dump that I want to parse, and I am having some difficulties/mysterious problems while using Python xml parser, ElementTree.
My recent problem is, ElementTree does not seem to find the texts that are actually there. This is an example data:
<page>
<title>Cengiz Han</title>
<ns>0</ns>
<id>10</id>
<revision>
<id>20337884</id>
<parentid>20218916</parentid>
<timestamp>2019-01-29T14:02:43Z</timestamp>
<contributor>
<username>CommonsDelinker</username>
<id>31545</id>
</contributor>
<comment>China_11b.jpg dosyası Map_of_China_1142.jpg ile değiştirildi</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...some long Genghis Khan stuff...
</text>
</page>
Now when I parse it with this:
for event, elem in et.iterparse('dataset/wiki_test', events=('start', 'end', 'start-ns', 'end-ns')):
if event == 'start':
if elem.tag == 'page':
if len(list(elem)) == 0:
continue
title = elem.find('title').text
if title == None or 'MediaWiki' in title:
elem.clear()
continue
wiki_id = elem.find('id')
if wiki_id == None:
elem.clear()
continue
wiki_id = wiki_id.text
revision = elem.find('revision')
if revision != None:
print(list(revision))
text = revision.find('text').text
print(text)
if text != None:
count += 1
titles += title + 'n'
page = 'wiki_id': wiki_id, 'title': title, 'text': text.text
pages += json.dumps(page, ensure_ascii=False) + 'n'
elem.clear()
revision.find('text').text
line seems to find no text for some elements, including the one above, and that some makes like one seventh of my data, which is annoying. This was also the case for page->id for some other entries, in which it claimed that element does not exist at all. I solved that problem by ignoring that ones, but I don't really want to do that, also this error does not make sense to me at all.
Here is another page, which works totally fine.
<page>
<title>Mustafa Suphi</title>
<ns>0</ns>
<id>22</id>
<revision>
<id>20077185</id>
<parentid>20017115</parentid>
<timestamp>2018-10-14T08:31:32Z</timestamp>
<contributor>
<username>Vikiçizer</username>
<id>90501</id>
</contributor>
<comment>/* top */düzeltme [[Vikipedi:AWB|AWB]] ile</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...some Mustafa Suphi stuff...
</text>
<sha1>m5finh6h2kr8h2fbtmsatp5fhz1siwq</sha1>
</revision>
</page>
What am I doing wrong?
python xml python-3.7
add a comment |
I have a wikipedia dump that I want to parse, and I am having some difficulties/mysterious problems while using Python xml parser, ElementTree.
My recent problem is, ElementTree does not seem to find the texts that are actually there. This is an example data:
<page>
<title>Cengiz Han</title>
<ns>0</ns>
<id>10</id>
<revision>
<id>20337884</id>
<parentid>20218916</parentid>
<timestamp>2019-01-29T14:02:43Z</timestamp>
<contributor>
<username>CommonsDelinker</username>
<id>31545</id>
</contributor>
<comment>China_11b.jpg dosyası Map_of_China_1142.jpg ile değiştirildi</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...some long Genghis Khan stuff...
</text>
</page>
Now when I parse it with this:
for event, elem in et.iterparse('dataset/wiki_test', events=('start', 'end', 'start-ns', 'end-ns')):
if event == 'start':
if elem.tag == 'page':
if len(list(elem)) == 0:
continue
title = elem.find('title').text
if title == None or 'MediaWiki' in title:
elem.clear()
continue
wiki_id = elem.find('id')
if wiki_id == None:
elem.clear()
continue
wiki_id = wiki_id.text
revision = elem.find('revision')
if revision != None:
print(list(revision))
text = revision.find('text').text
print(text)
if text != None:
count += 1
titles += title + 'n'
page = 'wiki_id': wiki_id, 'title': title, 'text': text.text
pages += json.dumps(page, ensure_ascii=False) + 'n'
elem.clear()
revision.find('text').text
line seems to find no text for some elements, including the one above, and that some makes like one seventh of my data, which is annoying. This was also the case for page->id for some other entries, in which it claimed that element does not exist at all. I solved that problem by ignoring that ones, but I don't really want to do that, also this error does not make sense to me at all.
Here is another page, which works totally fine.
<page>
<title>Mustafa Suphi</title>
<ns>0</ns>
<id>22</id>
<revision>
<id>20077185</id>
<parentid>20017115</parentid>
<timestamp>2018-10-14T08:31:32Z</timestamp>
<contributor>
<username>Vikiçizer</username>
<id>90501</id>
</contributor>
<comment>/* top */düzeltme [[Vikipedi:AWB|AWB]] ile</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...some Mustafa Suphi stuff...
</text>
<sha1>m5finh6h2kr8h2fbtmsatp5fhz1siwq</sha1>
</revision>
</page>
What am I doing wrong?
python xml python-3.7
1
event == 'start'
does not guarantee that the whole Element are present. Change toevent == 'stop'
– stovfl
Mar 7 at 17:21
Can you please upload a complete XML and explain what is the data structure you want to extract from this XML?
– balderman
Mar 9 at 16:13
add a comment |
I have a wikipedia dump that I want to parse, and I am having some difficulties/mysterious problems while using Python xml parser, ElementTree.
My recent problem is, ElementTree does not seem to find the texts that are actually there. This is an example data:
<page>
<title>Cengiz Han</title>
<ns>0</ns>
<id>10</id>
<revision>
<id>20337884</id>
<parentid>20218916</parentid>
<timestamp>2019-01-29T14:02:43Z</timestamp>
<contributor>
<username>CommonsDelinker</username>
<id>31545</id>
</contributor>
<comment>China_11b.jpg dosyası Map_of_China_1142.jpg ile değiştirildi</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...some long Genghis Khan stuff...
</text>
</page>
Now when I parse it with this:
for event, elem in et.iterparse('dataset/wiki_test', events=('start', 'end', 'start-ns', 'end-ns')):
if event == 'start':
if elem.tag == 'page':
if len(list(elem)) == 0:
continue
title = elem.find('title').text
if title == None or 'MediaWiki' in title:
elem.clear()
continue
wiki_id = elem.find('id')
if wiki_id == None:
elem.clear()
continue
wiki_id = wiki_id.text
revision = elem.find('revision')
if revision != None:
print(list(revision))
text = revision.find('text').text
print(text)
if text != None:
count += 1
titles += title + 'n'
page = 'wiki_id': wiki_id, 'title': title, 'text': text.text
pages += json.dumps(page, ensure_ascii=False) + 'n'
elem.clear()
revision.find('text').text
line seems to find no text for some elements, including the one above, and that some makes like one seventh of my data, which is annoying. This was also the case for page->id for some other entries, in which it claimed that element does not exist at all. I solved that problem by ignoring that ones, but I don't really want to do that, also this error does not make sense to me at all.
Here is another page, which works totally fine.
<page>
<title>Mustafa Suphi</title>
<ns>0</ns>
<id>22</id>
<revision>
<id>20077185</id>
<parentid>20017115</parentid>
<timestamp>2018-10-14T08:31:32Z</timestamp>
<contributor>
<username>Vikiçizer</username>
<id>90501</id>
</contributor>
<comment>/* top */düzeltme [[Vikipedi:AWB|AWB]] ile</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...some Mustafa Suphi stuff...
</text>
<sha1>m5finh6h2kr8h2fbtmsatp5fhz1siwq</sha1>
</revision>
</page>
What am I doing wrong?
python xml python-3.7
I have a wikipedia dump that I want to parse, and I am having some difficulties/mysterious problems while using Python xml parser, ElementTree.
My recent problem is, ElementTree does not seem to find the texts that are actually there. This is an example data:
<page>
<title>Cengiz Han</title>
<ns>0</ns>
<id>10</id>
<revision>
<id>20337884</id>
<parentid>20218916</parentid>
<timestamp>2019-01-29T14:02:43Z</timestamp>
<contributor>
<username>CommonsDelinker</username>
<id>31545</id>
</contributor>
<comment>China_11b.jpg dosyası Map_of_China_1142.jpg ile değiştirildi</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...some long Genghis Khan stuff...
</text>
</page>
Now when I parse it with this:
for event, elem in et.iterparse('dataset/wiki_test', events=('start', 'end', 'start-ns', 'end-ns')):
if event == 'start':
if elem.tag == 'page':
if len(list(elem)) == 0:
continue
title = elem.find('title').text
if title == None or 'MediaWiki' in title:
elem.clear()
continue
wiki_id = elem.find('id')
if wiki_id == None:
elem.clear()
continue
wiki_id = wiki_id.text
revision = elem.find('revision')
if revision != None:
print(list(revision))
text = revision.find('text').text
print(text)
if text != None:
count += 1
titles += title + 'n'
page = 'wiki_id': wiki_id, 'title': title, 'text': text.text
pages += json.dumps(page, ensure_ascii=False) + 'n'
elem.clear()
revision.find('text').text
line seems to find no text for some elements, including the one above, and that some makes like one seventh of my data, which is annoying. This was also the case for page->id for some other entries, in which it claimed that element does not exist at all. I solved that problem by ignoring that ones, but I don't really want to do that, also this error does not make sense to me at all.
Here is another page, which works totally fine.
<page>
<title>Mustafa Suphi</title>
<ns>0</ns>
<id>22</id>
<revision>
<id>20077185</id>
<parentid>20017115</parentid>
<timestamp>2018-10-14T08:31:32Z</timestamp>
<contributor>
<username>Vikiçizer</username>
<id>90501</id>
</contributor>
<comment>/* top */düzeltme [[Vikipedi:AWB|AWB]] ile</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...some Mustafa Suphi stuff...
</text>
<sha1>m5finh6h2kr8h2fbtmsatp5fhz1siwq</sha1>
</revision>
</page>
What am I doing wrong?
python xml python-3.7
python xml python-3.7
edited Mar 7 at 15:33
Burak Özmen
asked Mar 7 at 14:13
Burak ÖzmenBurak Özmen
4181421
4181421
1
event == 'start'
does not guarantee that the whole Element are present. Change toevent == 'stop'
– stovfl
Mar 7 at 17:21
Can you please upload a complete XML and explain what is the data structure you want to extract from this XML?
– balderman
Mar 9 at 16:13
add a comment |
1
event == 'start'
does not guarantee that the whole Element are present. Change toevent == 'stop'
– stovfl
Mar 7 at 17:21
Can you please upload a complete XML and explain what is the data structure you want to extract from this XML?
– balderman
Mar 9 at 16:13
1
1
event == 'start'
does not guarantee that the whole Element are present. Change to event == 'stop'
– stovfl
Mar 7 at 17:21
event == 'start'
does not guarantee that the whole Element are present. Change to event == 'stop'
– stovfl
Mar 7 at 17:21
Can you please upload a complete XML and explain what is the data structure you want to extract from this XML?
– balderman
Mar 9 at 16:13
Can you please upload a complete XML and explain what is the data structure you want to extract from this XML?
– balderman
Mar 9 at 16:13
add a comment |
1 Answer
1
active
oldest
votes
You have posted two examples: "working" and "not working".
In the "not working" one there is no
</revision>
Are you sure this the XML you have or it is just copy & paste mistake.
Copy paste mistake.
– Burak Özmen
Mar 9 at 16:08
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55045863%2felementtree-does-not-seem-to-get-some-texts-elements-in-the-tree%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You have posted two examples: "working" and "not working".
In the "not working" one there is no
</revision>
Are you sure this the XML you have or it is just copy & paste mistake.
Copy paste mistake.
– Burak Özmen
Mar 9 at 16:08
add a comment |
You have posted two examples: "working" and "not working".
In the "not working" one there is no
</revision>
Are you sure this the XML you have or it is just copy & paste mistake.
Copy paste mistake.
– Burak Özmen
Mar 9 at 16:08
add a comment |
You have posted two examples: "working" and "not working".
In the "not working" one there is no
</revision>
Are you sure this the XML you have or it is just copy & paste mistake.
You have posted two examples: "working" and "not working".
In the "not working" one there is no
</revision>
Are you sure this the XML you have or it is just copy & paste mistake.
answered Mar 9 at 7:08
baldermanbalderman
1,5831317
1,5831317
Copy paste mistake.
– Burak Özmen
Mar 9 at 16:08
add a comment |
Copy paste mistake.
– Burak Özmen
Mar 9 at 16:08
Copy paste mistake.
– Burak Özmen
Mar 9 at 16:08
Copy paste mistake.
– Burak Özmen
Mar 9 at 16:08
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55045863%2felementtree-does-not-seem-to-get-some-texts-elements-in-the-tree%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
event == 'start'
does not guarantee that the whole Element are present. Change toevent == 'stop'
– stovfl
Mar 7 at 17:21
Can you please upload a complete XML and explain what is the data structure you want to extract from this XML?
– balderman
Mar 9 at 16:13