ElementTree does not seem to get some texts/elements in the tree2019 Community Moderator ElectionHow to get all possible combinations of a list’s elements?Getting the last element of a list in PythonHow to get the number of elements in a list in Python?Python ElementTree: Parsing a string and getting ElementTree instanceHow to get all sub-elements of an element tree with Python ElementTree?Find element by text with XPath in ElementTreeHow to set ElementTree Element text field in the constructorPython ElementTree doesn't seem to recognize text nodesElementTree Returns Element Instead of ElementTreeHow to get the XML element text with ElementTree in Python

If I can solve Sudoku can I solve Travelling Salesman Problem(TSP)? If yes, how?

Combining an idiom with a metonymy

How can I track script which gives me "command not found" right after the login?

Existence of subset with given Hausdorff dimension

Is a party consisting of only a bard, a cleric, and a warlock functional long-term?

Life insurance that covers only simultaneous/dual deaths

Time travel from stationary position?

Brexit - No Deal Rejection

What has been your most complicated TikZ drawing?

Is it possible to upcast ritual spells?

compactness of a set where am I going wrong

What's the meaning of “spike” in the context of “adrenaline spike”?

Have researchers managed to "reverse time"? If so, what does that mean for physics?

A Cautionary Suggestion

How Could an Airship Be Repaired Mid-Flight

Does Mathematica reuse previous computations?

What approach do we need to follow for projects without a test environment?

Do the common programs (for example: "ls", "cat") in Linux and BSD come from the same source code?

The difference between「N分で」and「後N分で」

Co-worker team leader wants to inject his friend's awful software into our development. What should I say to our common boss?

How do anti-virus programs start at Windows boot?

If curse and magic is two sides of the same coin, why the former is forbidden?

Do I need life insurance if I can cover my own funeral costs?

How to change two letters closest to a string and one letter immediately after a string using notepad++



ElementTree does not seem to get some texts/elements in the tree



2019 Community Moderator ElectionHow to get all possible combinations of a list’s elements?Getting the last element of a list in PythonHow to get the number of elements in a list in Python?Python ElementTree: Parsing a string and getting ElementTree instanceHow to get all sub-elements of an element tree with Python ElementTree?Find element by text with XPath in ElementTreeHow to set ElementTree Element text field in the constructorPython ElementTree doesn't seem to recognize text nodesElementTree Returns Element Instead of ElementTreeHow to get the XML element text with ElementTree in Python










0















I have a wikipedia dump that I want to parse, and I am having some difficulties/mysterious problems while using Python xml parser, ElementTree.



My recent problem is, ElementTree does not seem to find the texts that are actually there. This is an example data:



<page>
<title>Cengiz Han</title>
<ns>0</ns>
<id>10</id>
<revision>
<id>20337884</id>
<parentid>20218916</parentid>
<timestamp>2019-01-29T14:02:43Z</timestamp>
<contributor>
<username>CommonsDelinker</username>
<id>31545</id>
</contributor>
<comment>China_11b.jpg dosyası Map_of_China_1142.jpg ile değiştirildi</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...some long Genghis Khan stuff...
</text>
</page>


Now when I parse it with this:



for event, elem in et.iterparse('dataset/wiki_test', events=('start', 'end', 'start-ns', 'end-ns')):
if event == 'start':
if elem.tag == 'page':
if len(list(elem)) == 0:
continue
title = elem.find('title').text
if title == None or 'MediaWiki' in title:
elem.clear()
continue
wiki_id = elem.find('id')
if wiki_id == None:
elem.clear()
continue
wiki_id = wiki_id.text
revision = elem.find('revision')
if revision != None:
print(list(revision))
text = revision.find('text').text
print(text)
if text != None:
count += 1
titles += title + 'n'
page = 'wiki_id': wiki_id, 'title': title, 'text': text.text
pages += json.dumps(page, ensure_ascii=False) + 'n'
elem.clear()


revision.find('text').text line seems to find no text for some elements, including the one above, and that some makes like one seventh of my data, which is annoying. This was also the case for page->id for some other entries, in which it claimed that element does not exist at all. I solved that problem by ignoring that ones, but I don't really want to do that, also this error does not make sense to me at all.



Here is another page, which works totally fine.



<page>
<title>Mustafa Suphi</title>
<ns>0</ns>
<id>22</id>
<revision>
<id>20077185</id>
<parentid>20017115</parentid>
<timestamp>2018-10-14T08:31:32Z</timestamp>
<contributor>
<username>Vikiçizer</username>
<id>90501</id>
</contributor>
<comment>/* top */düzeltme [[Vikipedi:AWB|AWB]] ile</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...some Mustafa Suphi stuff...
</text>
<sha1>m5finh6h2kr8h2fbtmsatp5fhz1siwq</sha1>
</revision>
</page>


What am I doing wrong?










share|improve this question



















  • 1





    event == 'start' does not guarantee that the whole Element are present. Change to event == 'stop'

    – stovfl
    Mar 7 at 17:21











  • Can you please upload a complete XML and explain what is the data structure you want to extract from this XML?

    – balderman
    Mar 9 at 16:13















0















I have a wikipedia dump that I want to parse, and I am having some difficulties/mysterious problems while using Python xml parser, ElementTree.



My recent problem is, ElementTree does not seem to find the texts that are actually there. This is an example data:



<page>
<title>Cengiz Han</title>
<ns>0</ns>
<id>10</id>
<revision>
<id>20337884</id>
<parentid>20218916</parentid>
<timestamp>2019-01-29T14:02:43Z</timestamp>
<contributor>
<username>CommonsDelinker</username>
<id>31545</id>
</contributor>
<comment>China_11b.jpg dosyası Map_of_China_1142.jpg ile değiştirildi</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...some long Genghis Khan stuff...
</text>
</page>


Now when I parse it with this:



for event, elem in et.iterparse('dataset/wiki_test', events=('start', 'end', 'start-ns', 'end-ns')):
if event == 'start':
if elem.tag == 'page':
if len(list(elem)) == 0:
continue
title = elem.find('title').text
if title == None or 'MediaWiki' in title:
elem.clear()
continue
wiki_id = elem.find('id')
if wiki_id == None:
elem.clear()
continue
wiki_id = wiki_id.text
revision = elem.find('revision')
if revision != None:
print(list(revision))
text = revision.find('text').text
print(text)
if text != None:
count += 1
titles += title + 'n'
page = 'wiki_id': wiki_id, 'title': title, 'text': text.text
pages += json.dumps(page, ensure_ascii=False) + 'n'
elem.clear()


revision.find('text').text line seems to find no text for some elements, including the one above, and that some makes like one seventh of my data, which is annoying. This was also the case for page->id for some other entries, in which it claimed that element does not exist at all. I solved that problem by ignoring that ones, but I don't really want to do that, also this error does not make sense to me at all.



Here is another page, which works totally fine.



<page>
<title>Mustafa Suphi</title>
<ns>0</ns>
<id>22</id>
<revision>
<id>20077185</id>
<parentid>20017115</parentid>
<timestamp>2018-10-14T08:31:32Z</timestamp>
<contributor>
<username>Vikiçizer</username>
<id>90501</id>
</contributor>
<comment>/* top */düzeltme [[Vikipedi:AWB|AWB]] ile</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...some Mustafa Suphi stuff...
</text>
<sha1>m5finh6h2kr8h2fbtmsatp5fhz1siwq</sha1>
</revision>
</page>


What am I doing wrong?










share|improve this question



















  • 1





    event == 'start' does not guarantee that the whole Element are present. Change to event == 'stop'

    – stovfl
    Mar 7 at 17:21











  • Can you please upload a complete XML and explain what is the data structure you want to extract from this XML?

    – balderman
    Mar 9 at 16:13













0












0








0








I have a wikipedia dump that I want to parse, and I am having some difficulties/mysterious problems while using Python xml parser, ElementTree.



My recent problem is, ElementTree does not seem to find the texts that are actually there. This is an example data:



<page>
<title>Cengiz Han</title>
<ns>0</ns>
<id>10</id>
<revision>
<id>20337884</id>
<parentid>20218916</parentid>
<timestamp>2019-01-29T14:02:43Z</timestamp>
<contributor>
<username>CommonsDelinker</username>
<id>31545</id>
</contributor>
<comment>China_11b.jpg dosyası Map_of_China_1142.jpg ile değiştirildi</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...some long Genghis Khan stuff...
</text>
</page>


Now when I parse it with this:



for event, elem in et.iterparse('dataset/wiki_test', events=('start', 'end', 'start-ns', 'end-ns')):
if event == 'start':
if elem.tag == 'page':
if len(list(elem)) == 0:
continue
title = elem.find('title').text
if title == None or 'MediaWiki' in title:
elem.clear()
continue
wiki_id = elem.find('id')
if wiki_id == None:
elem.clear()
continue
wiki_id = wiki_id.text
revision = elem.find('revision')
if revision != None:
print(list(revision))
text = revision.find('text').text
print(text)
if text != None:
count += 1
titles += title + 'n'
page = 'wiki_id': wiki_id, 'title': title, 'text': text.text
pages += json.dumps(page, ensure_ascii=False) + 'n'
elem.clear()


revision.find('text').text line seems to find no text for some elements, including the one above, and that some makes like one seventh of my data, which is annoying. This was also the case for page->id for some other entries, in which it claimed that element does not exist at all. I solved that problem by ignoring that ones, but I don't really want to do that, also this error does not make sense to me at all.



Here is another page, which works totally fine.



<page>
<title>Mustafa Suphi</title>
<ns>0</ns>
<id>22</id>
<revision>
<id>20077185</id>
<parentid>20017115</parentid>
<timestamp>2018-10-14T08:31:32Z</timestamp>
<contributor>
<username>Vikiçizer</username>
<id>90501</id>
</contributor>
<comment>/* top */düzeltme [[Vikipedi:AWB|AWB]] ile</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...some Mustafa Suphi stuff...
</text>
<sha1>m5finh6h2kr8h2fbtmsatp5fhz1siwq</sha1>
</revision>
</page>


What am I doing wrong?










share|improve this question
















I have a wikipedia dump that I want to parse, and I am having some difficulties/mysterious problems while using Python xml parser, ElementTree.



My recent problem is, ElementTree does not seem to find the texts that are actually there. This is an example data:



<page>
<title>Cengiz Han</title>
<ns>0</ns>
<id>10</id>
<revision>
<id>20337884</id>
<parentid>20218916</parentid>
<timestamp>2019-01-29T14:02:43Z</timestamp>
<contributor>
<username>CommonsDelinker</username>
<id>31545</id>
</contributor>
<comment>China_11b.jpg dosyası Map_of_China_1142.jpg ile değiştirildi</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...some long Genghis Khan stuff...
</text>
</page>


Now when I parse it with this:



for event, elem in et.iterparse('dataset/wiki_test', events=('start', 'end', 'start-ns', 'end-ns')):
if event == 'start':
if elem.tag == 'page':
if len(list(elem)) == 0:
continue
title = elem.find('title').text
if title == None or 'MediaWiki' in title:
elem.clear()
continue
wiki_id = elem.find('id')
if wiki_id == None:
elem.clear()
continue
wiki_id = wiki_id.text
revision = elem.find('revision')
if revision != None:
print(list(revision))
text = revision.find('text').text
print(text)
if text != None:
count += 1
titles += title + 'n'
page = 'wiki_id': wiki_id, 'title': title, 'text': text.text
pages += json.dumps(page, ensure_ascii=False) + 'n'
elem.clear()


revision.find('text').text line seems to find no text for some elements, including the one above, and that some makes like one seventh of my data, which is annoying. This was also the case for page->id for some other entries, in which it claimed that element does not exist at all. I solved that problem by ignoring that ones, but I don't really want to do that, also this error does not make sense to me at all.



Here is another page, which works totally fine.



<page>
<title>Mustafa Suphi</title>
<ns>0</ns>
<id>22</id>
<revision>
<id>20077185</id>
<parentid>20017115</parentid>
<timestamp>2018-10-14T08:31:32Z</timestamp>
<contributor>
<username>Vikiçizer</username>
<id>90501</id>
</contributor>
<comment>/* top */düzeltme [[Vikipedi:AWB|AWB]] ile</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...some Mustafa Suphi stuff...
</text>
<sha1>m5finh6h2kr8h2fbtmsatp5fhz1siwq</sha1>
</revision>
</page>


What am I doing wrong?







python xml python-3.7






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 7 at 15:33







Burak Özmen

















asked Mar 7 at 14:13









Burak ÖzmenBurak Özmen

4181421




4181421







  • 1





    event == 'start' does not guarantee that the whole Element are present. Change to event == 'stop'

    – stovfl
    Mar 7 at 17:21











  • Can you please upload a complete XML and explain what is the data structure you want to extract from this XML?

    – balderman
    Mar 9 at 16:13












  • 1





    event == 'start' does not guarantee that the whole Element are present. Change to event == 'stop'

    – stovfl
    Mar 7 at 17:21











  • Can you please upload a complete XML and explain what is the data structure you want to extract from this XML?

    – balderman
    Mar 9 at 16:13







1




1





event == 'start' does not guarantee that the whole Element are present. Change to event == 'stop'

– stovfl
Mar 7 at 17:21





event == 'start' does not guarantee that the whole Element are present. Change to event == 'stop'

– stovfl
Mar 7 at 17:21













Can you please upload a complete XML and explain what is the data structure you want to extract from this XML?

– balderman
Mar 9 at 16:13





Can you please upload a complete XML and explain what is the data structure you want to extract from this XML?

– balderman
Mar 9 at 16:13












1 Answer
1






active

oldest

votes


















0














You have posted two examples: "working" and "not working".



In the "not working" one there is no



</revision>



Are you sure this the XML you have or it is just copy & paste mistake.






share|improve this answer























  • Copy paste mistake.

    – Burak Özmen
    Mar 9 at 16:08










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55045863%2felementtree-does-not-seem-to-get-some-texts-elements-in-the-tree%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














You have posted two examples: "working" and "not working".



In the "not working" one there is no



</revision>



Are you sure this the XML you have or it is just copy & paste mistake.






share|improve this answer























  • Copy paste mistake.

    – Burak Özmen
    Mar 9 at 16:08















0














You have posted two examples: "working" and "not working".



In the "not working" one there is no



</revision>



Are you sure this the XML you have or it is just copy & paste mistake.






share|improve this answer























  • Copy paste mistake.

    – Burak Özmen
    Mar 9 at 16:08













0












0








0







You have posted two examples: "working" and "not working".



In the "not working" one there is no



</revision>



Are you sure this the XML you have or it is just copy & paste mistake.






share|improve this answer













You have posted two examples: "working" and "not working".



In the "not working" one there is no



</revision>



Are you sure this the XML you have or it is just copy & paste mistake.







share|improve this answer












share|improve this answer



share|improve this answer










answered Mar 9 at 7:08









baldermanbalderman

1,5831317




1,5831317












  • Copy paste mistake.

    – Burak Özmen
    Mar 9 at 16:08

















  • Copy paste mistake.

    – Burak Özmen
    Mar 9 at 16:08
















Copy paste mistake.

– Burak Özmen
Mar 9 at 16:08





Copy paste mistake.

– Burak Özmen
Mar 9 at 16:08



















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55045863%2felementtree-does-not-seem-to-get-some-texts-elements-in-the-tree%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Can't initialize raids on a new ASUS Prime B360M-A motherboard2019 Community Moderator ElectionSimilar to RAID config yet more like mirroring solution?Can't get motherboard serial numberWhy does the BIOS entry point start with a WBINVD instruction?UEFI performance Asus Maximus V Extreme

Identity Server 4 is not redirecting to Angular app after login2019 Community Moderator ElectionIdentity Server 4 and dockerIdentityserver implicit flow unauthorized_clientIdentityServer Hybrid Flow - Access Token is null after user successful loginIdentity Server to MVC client : Page Redirect After loginLogin with Steam OpenId(oidc-client-js)Identity Server 4+.NET Core 2.0 + IdentityIdentityServer4 post-login redirect not working in Edge browserCall to IdentityServer4 generates System.NullReferenceException: Object reference not set to an instance of an objectIdentityServer4 without HTTPS not workingHow to get Authorization code from identity server without login form

2005 Ahvaz unrest Contents Background Causes Casualties Aftermath See also References Navigation menue"At Least 10 Are Killed by Bombs in Iran""Iran"Archived"Arab-Iranians in Iran to make April 15 'Day of Fury'"State of Mind, State of Order: Reactions to Ethnic Unrest in the Islamic Republic of Iran.10.1111/j.1754-9469.2008.00028.x"Iran hangs Arab separatists"Iran Overview from ArchivedConstitution of the Islamic Republic of Iran"Tehran puzzled by forged 'riots' letter""Iran and its minorities: Down in the second class""Iran: Handling Of Ahvaz Unrest Could End With Televised Confessions""Bombings Rock Iran Ahead of Election""Five die in Iran ethnic clashes""Iran: Need for restraint as anniversary of unrest in Khuzestan approaches"Archived"Iranian Sunni protesters killed in clashes with security forces"Archived