[Python] HTMLParser Handle_StartTag

Question

Bonjour,
J'aimerais comprendre comment fonctionne cette méthode.

J'ai redéfini mon Parser, avec sa fonction HandleStartTag:

class MyHTMLParser(HTMLParser):

def handle_starttag(self, tag, attrs):
print "Encountered the beginning of a %s tag" % tag

Ensuite, je fais:

p = MyHTMLParser()
test = '<A HREF="https://www.cwi.nl/"> <td class="src"> Oyo <td> <tr> <td> ceci est un </td> test'
p.feed(test)
p.handle_starttag('a', [('href', 'https://www.cwi.nl/')])

et ils me retournent ça:

Encountered the beginning of a a tag
Encountered the beginning of a td tag
Encountered the beginning of a td tag
Encountered the beginning of a tr tag
Encountered the beginning of a td tag
Encountered the end of a td tag
Encountered the beginning of a a tag

Ne devrais-je pas avoir "Encountered the beginning of a a tag" uniquement?

Merci

a+
dje-dje

PS:
J'ai été voir là:
http://www.python.org/doc/current/lib/module-HTMLParser.html
Ils y donnent ça:

handle_starttag( tag, attrs)
This method is called to handle the start of a tag. It is intended to be overridden by a derived class; the base class implementation does nothing.

The tag argument is the name of the tag converted to lower case. The attrs argument is a list of (name, value) pairs containing the attributes found inside the tag's <> brackets. The name will be translated to lower case and double quotes and backslashes in the value have been interpreted. For instance, for the tag <A HREF="https://www.cwi.nl/">, this method would be called as "handle_starttag('a', [('href', 'https://www.cwi.nl/')])".

Il y a 10 types de personne dans le monde,
ceux qui comprennent le binaire et les autres

Afficher la suite

sebsauvage · Answer

Ne devrais-je pas avoir "Encountered the beginning of a a tag" uniquement? Non.Le résultat est normal.Dès que tu fais un .feed(), il va appeler handle_starttag() pour chacun des tags rencontré dans la chaîne test (A, TD, etc.)

dje-dje · Answer

OK.
Mais alors, comment faire pour n'avoir que ce tag en sortie? (à quoi ça sert de préciser le type de Tag dans Handle_StartTag, ainsi que la valeur de l'attribut?)

Est-ce possible avec les fonctions de bases? Ou faut-il la personnaliser?
(Dans l'idéal, je voudrais qu'ils fassent la différence entre des tag dont les valeursd'attributs sont différents.)
Merci

a+
dje-dje

Il y a 10 types de personne dans le monde,
ceux qui comprennent le binaire et les autres

sebsauvage · Answer

Il suffit de mettre un IF dans handle_starttag().
Tu n'as pas besoin d'appeller toi-même handle_starttag(): c'est le htmlparser qui l'appellera lui-même lorsque tu fera un feed().

J'ai mis un exemple là:
http://www.sebsauvage.net/python/snyppets/index.html#getlinks2

[Python] HTMLParser Handle_StartTag

3 réponses

Discussions similaires

Newsletters