Extraire des données d'un fichier HTML

Fermé
rius Messages postés 9 Date d'inscription mercredi 21 novembre 2007 Statut Membre Dernière intervention 24 mai 2018 - Modifié le 11 sept. 2017 à 10:18
Flachy Joe Messages postés 2103 Date d'inscription jeudi 16 septembre 2004 Statut Membre Dernière intervention 21 novembre 2023 - 5 sept. 2017 à 21:08
Bonjour,

J'ai un fichier avec du code html et je souhaiterais pouvoir en extraire des données.
Le code est pas du tout formaté, mais je ne peux pas faire autrement :(

Il est extrait avec l'outil "inspecter" du navigateur

<tbody><tr style="height: auto;"><th style="height: 0px; width: 80px;"></th><th style="height: 0px; width: 50px;"></th><th style="height: 0px; width: 60px;"></th><th style="height: 0px; width: 60px;"></th><th style="height: 0px; width: 100px;"></th><th style="height: 0px; width: 80px;"></th><th style="height: 0px; width: 80px;"></th><th style="height: 0px; width: 80px;"></th><th style="height: 0px; width: 90px;"></th></tr><tr class=" ev_dhx_web"><td align="left" valign="middle" class="                      " title="VINT2017_1">VINT2017_1</td><td align="center" valign="middle" class="                      " title="DC6">DC6</td><td align="left" valign="middle" class="                      " title="SBEG">SBEG</td><td align="left" valign="middle" class="                      ">SBTF</td><td align="left" valign="middle" class="                      " title="FLC01">FLC01</td><td align="right" valign="middle" class="                      " title="01h 18m">01h 18m</td><td align="right" valign="middle" class="                      " title="281">281</td><td align="right" valign="middle" class="                      " title="95.00">95.00</td><td align="right" valign="middle" class="                      " title="Aug 19 2017">Aug 19 2017</td></tr><tr class=" odd_dhx_web"><td align="left" valign="middle" class="                 " title="VINT2017_7">VINT2017_7</td><td align="center" valign="middle" class="                 " title="DC6">DC6</td><td align="left" valign="middle" class="                 " title="SPJC">SPJC</td><td align="left" valign="middle" class="                 " title="SPZO">SPZO</td><td align="left" valign="middle" class="                 " title="FLC01">FLC01</td><td align="right" valign="middle" class="                 " title="01h 18m">01h 18m</td><td align="right" valign="middle" class="                 " title="316">316</td><td align="right" valign="middle" class="                 " title="101.50">101.50</td><td align="right" valign="middle" class="                 " title="Aug 23 2017">Aug 23 2017</td></tr><tr class=" ev_dhx_web"><td align="left" valign="middle" class="      " title="VINT2017_6">VINT2017_6</td><td align="center" valign="middle" class="      ">DC6</td><td align="left" valign="middle" class="      " title="SPHI">SPHI</td><td align="left" valign="middle" class="      ">SPJC</td><td align="left" valign="middle" class="      ">FLC01</td><td align="right" valign="middle" class="      " title="01h 24m">01h 24m</td><td align="right" valign="middle" class="      " title="353">353</td><td align="right" valign="middle" class="      " title="96.25">96.25</td><td align="right" valign="middle" class="      " title="Aug 20 2017">Aug 20 2017</td></tr><tr class=" odd_dhx_web"><td align="left" valign="middle" class="    " title="VINT2017_5">VINT2017_5</td><td align="center" valign="middle" class="    " title="DC6">DC6</td><td align="left" valign="middle" class="    ">SECU</td><td align="left" valign="middle" class="    ">SPHI</td><td align="left" valign="middle" class="    ">FLC01</td><td align="right" valign="middle" class="    " title="01h 06m">01h 06m</td><td align="right" valign="middle" class="    ">239</td><td align="right" valign="middle" class="    ">103.50</td><td align="right" valign="middle" class="    ">Aug 20 2017</td></tr><tr class=" ev_dhx_web"><td align="left" valign="middle" class="       ">VINT2017_9</td><td align="center" valign="middle" class="       ">DC6</td><td align="left" valign="middle" class="       ">SLLP</td><td align="left" valign="middle" class="       " title="SLET">SLET</td><td align="left" valign="middle" class="       " title="FLC01">FLC01</td><td align="right" valign="middle" class="       " title="01h 18m">01h 18m</td><td align="right" valign="middle" class="       ">297</td><td align="right" valign="middle" class="       ">-1.75</td><td align="right" valign="middle" class="       ">Aug 24 2017</td></tr><tr class=" odd_dhx_web"><td align="left" valign="middle" class=" ">VINT2017_10</td><td align="center" valign="middle" class=" ">DC6</td><td align="left" valign="middle" class=" " title="SLET">SLET</td><td align="left" valign="middle" class=" ">SBCY</td><td align="left" valign="middle" class=" ">FLC01</td><td align="right" valign="middle" class=" ">01h 48m</td><td align="right" valign="middle" class=" ">426</td><td align="right" valign="middle" class=" ">105.50</td><td align="right" valign="middle" class=" ">Aug 25 2017</td></tr><tr class=" ev_dhx_web"><td align="left" valign="middle">VINT2017_4</td><td align="center" valign="middle">DC6</td><td align="left" valign="middle">SPQT</td><td align="left" valign="middle">SECU</td><td align="left" valign="middle">FLC01</td><td align="right" valign="middle">01h 30m</td><td align="right" valign="middle">345</td><td align="right" valign="middle">101.50</td><td align="right" valign="middle">Aug 20 2017</td></tr><tr class=" odd_dhx_web"><td align="left" valign="middle" class=" " title="VINT2017_3">VINT2017_3</td><td align="center" valign="middle" class=" ">DC6</td><td align="left" valign="middle" class=" ">SBTT</td><td align="left" valign="middle" class=" ">SPQT</td><td align="left" valign="middle" class=" ">FLC01</td><td align="right" valign="middle" class=" ">01h 06m</td><td align="right" valign="middle" class=" ">203</td><td align="right" valign="middle" class=" ">100.00</td><td align="right" valign="middle" class=" ">Aug 20 2017</td></tr><tr class=" ev_dhx_web"><td align="left" valign="middle">TAP1001</td><td align="center" valign="middle">738</td><td align="left" valign="middle">LPPT</td><td align="left" valign="middle">LPMA</td><td align="left" valign="middle">FLC01</td><td align="right" valign="middle">01h 36m</td><td align="right" valign="middle">521</td><td align="right" valign="middle">101.50</td><td align="right" valign="middle">Aug 30 2017</td></tr><tr class=" odd_dhx_web"><td align="left" valign="middle">VINT2017_2</td><td align="center" valign="middle">DC6</td><td align="left" valign="middle">SBTF</td><td align="left" valign="middle">SBTT</td><td align="left" valign="middle">FLC01</td><td align="right" valign="middle">01h 24m</td><td align="right" valign="middle">316</td><td align="right" valign="middle">87.75</td><td align="right" valign="middle">Aug 20 2017</td></tr><tr class=" ev_dhx_web"><td align="left" valign="middle" class=" ">VINT2017_8</td><td align="center" valign="middle" class=" ">DC6</td><td align="left" valign="middle" class=" ">SPZO</td><td align="left" valign="middle" class=" ">SLLP</td><td align="left" valign="middle" class=" ">FLC01</td><td align="right" valign="middle" class=" ">01h 12m</td><td align="right" valign="middle" class=" ">282</td><td align="right" valign="middle" class=" ">101.50</td><td align="right" valign="middle" class=" " title="Aug 23 2017">Aug 23 2017</td></tr><tr class=" odd_dhx_web"><td align="left" valign="middle" class=" ">VINT2017_10</td><td align="center" valign="middle" class=" ">DC6</td><td align="left" valign="middle" class=" ">SLET</td><td align="left" valign="middle" class=" ">SBCY</td><td align="left" valign="middle" class=" ">FLC02</td><td align="right" valign="middle" class=" ">01h 48m</td><td align="right" valign="middle" class=" ">426</td><td align="right" valign="middle" class=" ">98.25</td><td align="right" valign="middle" class=" " title="Aug 25 2017">Aug 25 2017</td></tr><tr class=" ev_dhx_web"><td align="left" valign="middle" class=" ">VINT2017_9</td><td align="center" valign="middle" class=" ">DC6</td><td align="left" valign="middle" class=" ">SLLP</td><td align="left" valign="middle" class=" ">SLET</td><td align="left" valign="middle" class=" ">FLC02</td><td align="right" valign="middle" class=" ">01h 18m</td><td align="right" valign="middle" class=" ">297</td><td align="right" valign="middle" class=" ">63.50</td><td align="right" valign="middle" class=" " title="Aug 24 2017">Aug 24 2017</td></tr><tr class=" odd_dhx_web"><td align="left" valign="middle" class=" ">VINT2017_8</td><td align="center" valign="middle" class=" ">DC6</td><td align="left" valign="middle" class=" ">SPZO</td><td align="left" valign="middle" class=" ">SLLP</td><td align="left" valign="middle" class=" ">FLC02</td><td align="right" valign="middle" class=" ">01h 18m</td><td align="right" valign="middle" class=" ">282</td><td align="right" valign="middle" class=" ">100.25</td><td align="right" valign="middle" class=" " title="Aug 23 2017">Aug 23 2017</td></tr><tr class=" ev_dhx_web"><td align="left" valign="middle" class=" ">TAP1001</td><td align="center" valign="middle" class=" ">738</td><td align="left" valign="middle" class=" ">LPPT</td><td align="left" valign="middle" class=" ">LPMA</td><td align="left" valign="middle" class=" ">FLC02</td><td align="right" valign="middle" class=" ">01h 48m</td><td align="right" valign="middle" class=" ">521</td><td align="right" valign="middle" class=" ">103.50</td><td align="right" valign="middle" class=" " title="Aug 30 2017">Aug 30 2017</td></tr><tr class=" odd_dhx_web"><td align="left" valign="middle" class="   ">VINT2017_7</td><td align="center" valign="middle" class="   ">DC6</td><td align="left" valign="middle" class="   ">SPJC</td><td align="left" valign="middle" class="   ">SPZO</td><td align="left" valign="middle" class="   ">FLC02</td><td align="right" valign="middle" class="   " title="02h 00m">02h 00m</td><td align="right" valign="middle" class="   ">316</td><td align="right" valign="middle" class="   ">100.00</td><td align="right" valign="middle" class="   " title="Aug 23 2017">Aug 23 2017</td></tr><tr class=" ev_dhx_web"><td align="left" valign="middle">VINT2017_6</td><td align="center" valign="middle">DC6</td><td align="left" valign="middle">SPHI</td><td align="left" valign="middle">SPJC</td><td align="left" valign="middle">FLC02</td><td align="right" valign="middle">01h 24m</td><td align="right" valign="middle">353</td><td align="right" valign="middle">100.00</td><td align="right" valign="middle">Aug 20 2017</td></tr><tr class=" odd_dhx_web"><td align="left" valign="middle">VINT2017_5</td><td align="center" valign="middle">DC6</td><td align="left" valign="middle">SECU</td><td align="left" valign="middle">SPHI</td><td align="left" valign="middle">FLC02</td><td align="right" valign="middle">01h 00m</td><td align="right" valign="middle">239</td><td align="right" valign="middle">88.75</td><td align="right" valign="middle">Aug 20 2017</td></tr><tr class=" ev_dhx_web"><td align="left" valign="middle">VINT2017_4</td><td align="center" valign="middle">DC6</td><td align="left" valign="middle">SPQT</td><td align="left" valign="middle">SECU</td><td align="left" valign="middle">FLC02</td><td align="right" valign="middle">01h 30m</td><td align="right" valign="middle">345</td><td align="right" valign="middle">100.00</td><td align="right" valign="middle">Aug 20 2017</td></tr><tr class=" odd_dhx_web"><td align="left" valign="middle">VINT2017_3</td><td align="center" valign="middle">DC6</td><td align="left" valign="middle">SBTT</td><td align="left" valign="middle">SPQT</td><td align="left" valign="middle">FLC02</td><td align="right" valign="middle">01h 06m</td><td align="right" valign="middle">203</td><td align="right" valign="middle">105.50</td><td align="right" valign="middle">Aug 20 2017</td></tr><tr class=" ev_dhx_web"><td align="left" valign="middle">VINT2017_2</td><td align="center" valign="middle">DC6</td><td align="left" valign="middle">SBTF</td><td align="left" valign="middle">SBTT</td><td align="left" valign="middle">FLC02</td><td align="right" valign="middle">01h 24m</td><td align="right" valign="middle">316</td><td align="right" valign="middle">91.50</td><td align="right" valign="middle">Aug 20 2017</td></tr><tr class=" odd_dhx_web"><td align="left" valign="middle">VINT2017_1</td><td align="center" valign="middle">DC6</td><td align="left" valign="middle">SBEG</td><td align="left" valign="middle">SBTF</td><td align="left" valign="middle">FLC02</td><td align="right" valign="middle">01h 18m</td><td align="right" valign="middle">281</td><td align="right" valign="middle">93.50</td><td align="right" valign="middle">Aug 19 2017</td></tr><tr class=" ev_dhx_web"><td align="left" valign="middle">TAP1002</td><td align="center" valign="middle">738</td><td align="left" valign="middle">LEMG</td><td align="left" valign="middle">LEMH</td><td align="left" valign="middle">FLC02</td><td align="right" valign="middle">01h 18m</td><td align="right" valign="middle">454</td><td align="right" valign="middle">102.25</td><td align="right" valign="middle">Aug 31 2017</td></tr></tbody>


Dans le code, on peut trouver plusieurs chaînes qui commence par FLC (exemple FLC01, FLC02) et un peut plus loin on trouve entre guillemet des heures.

Ce que j'aimerais bien, c'est pour chaque code FLC avoir la somme des heures.

Merci pour votre aide
A voir également:

1 réponse

Flachy Joe Messages postés 2103 Date d'inscription jeudi 16 septembre 2004 Statut Membre Dernière intervention 21 novembre 2023 259
Modifié le 3 sept. 2017 à 17:07
Salut
déjà on peut extraire les données intéressantes :
flo@bidul:~/Test$ cat data.htm
<tbody><tr style="height: auto;"><th style="height: 0px; width: 80px;"></th><th style="height: 0px; width: 50px;"></th><th style="height: 0px; width: 60px;"></th><th style="height: 0px; width: 60px;"></th><th style="height: 0px; width: 100px;"></th><th style="height: 0px; width: 80px;"></th><th style="height: 0px; width: 80px;"></th><th style="height: 0px; width: 80px;"></th><th style="height: 0px; width: 90px;"></th></tr><tr class=" ev_dhx_web"><td align="left" valign="middle" class=" " title="VINT2017_1">VINT2017_1</td><td align="center" valign="middle" class=" " title="DC6">
[...]
<td align="left" valign="middle">TAP1002</td><td align="center" valign="middle">738</td><td align="left" valign="middle">LEMG</td><td align="left" valign="middle">LEMH</td><td align="left" valign="middle">FLC02</td><td align="right" valign="middle">01h 18m</td><td align="right" valign="middle">454</td><td align="right" valign="middle">102.25</td><td align="right" valign="middle">Aug 31 2017</td></tr></tbody>

flo@bidul:~/Test$ sed "s/tr><tr/>\n</g;s/<[^>]*>/,/g;s/,,/,/g" < data.htm | cut -d, -f6-7
,
FLC01,01h 18m
FLC01,01h 18m
FLC01,01h 24m
FLC01,01h 06m
FLC01,01h 18m
FLC01,01h 48m
FLC01,01h 30m
FLC01,01h 06m
FLC01,01h 36m
FLC01,01h 24m
FLC01,01h 12m
FLC02,01h 48m
FLC02,01h 18m
FLC02,01h 18m
FLC02,01h 48m
FLC02,02h 00m
FLC02,01h 24m
FLC02,01h 00m
FLC02,01h 30m
FLC02,01h 06m
FLC02,01h 24m
FLC02,01h 18m
FLC02,01h 18m


je réfléchis à la suite...
;-) Flachy Joe ;-)
"Qui ne se plante jamais n'a aucune chance de pousser !" Graf anonyme
0
Flachy Joe Messages postés 2103 Date d'inscription jeudi 16 septembre 2004 Statut Membre Dernière intervention 21 novembre 2023 259
3 sept. 2017 à 17:36
Avec un peu de awk ça donne ça :
sed "s/tr><tr/>\n</g;s/<[^>]*>/,/g;s/,,/,/g;s/h /,/g" < data.htm | cut -d, -f6-8 | awk -F "," '{ t[$1]+=$2; t1[$1]+=$3} END {for(n in t)printf "%s %ih%imn\n", n, t[n]+int(t1[n]/60), t1[n]%60 }'
0h0mn
FLC01 15h0mn
FLC02 17h12mn


Il y a sans doute moyen de tout faire en awk mais je suis pas expert...
0
rius Messages postés 9 Date d'inscription mercredi 21 novembre 2007 Statut Membre Dernière intervention 24 mai 2018 > Flachy Joe Messages postés 2103 Date d'inscription jeudi 16 septembre 2004 Statut Membre Dernière intervention 21 novembre 2023
5 sept. 2017 à 11:31
Merci beaucoup, je vais aller mettre sa en place de suite
0
Flachy Joe Messages postés 2103 Date d'inscription jeudi 16 septembre 2004 Statut Membre Dernière intervention 21 novembre 2023 259
5 sept. 2017 à 21:08
Si y a besoin d'explications sur les différentes commandes, faut pas hésiter.
0