Compare two CSV files with Python Solved

Question

Hello, I’m trying to compare 2 CSV files to extract the similarities into another file, but at the moment the program outputs the information from the 2 files. Here is the program:  import csv
with open('Recherche.csv', 'r',encoding='utf-8') as t1, open('TravailSFE.csv', 'r',encoding='utf-8') as t2:
    fileone = t1.readlines()
    filetwo = t2.readlines()
with open('update.csv', 'w',encoding='utf-8') as outFile:
    for line in filetwo:
        if line in fileone:
            outFile.write(lines) To explain, the Recherche file contains in the first column the company addresses and in the second column the company SIRENs. And the Travail file contains just the addresses, and I would therefore like the update file to return the addresses that are similar so as to extract the SIRENs (I’m not sure if that’s clear enough ;)) If you can help me, I’d appreciate it.

Bruno83200_6929 · Answer

Hello,

If I understand correctly what you want to do. You have two CSV files: one contains company addresses with their SIREN, and the other contains only the addresses. You want to create an output file with the common addresses and the corresponding SIREN.

If that’s right, you should first check that the CSV files are properly formatted and that the addresses in both files are in the same format to avoid errors during comparison.

Next, you need to adapt your program to read the CSV files using the csv module to handle the data in a structured way.

Create a dictionary from the first CSV file to store the addresses and SIREN.

Compare the addresses from the second file with those in the dictionary.

Write the common addresses and their SIREN to the output file.

A small program like this:

import csv

# Lire le fichier Recherche.csv et créer un dictionnaire pour les adresses et SIREN
adresses_siren = {}
with open('Recherche.csv', 'r', encoding='utf-8') as recherche_file:
    reader = csv.reader(recherche_file)
    next(reader)  # Si votre fichier a une ligne d'en-tête, sinon retirez cette ligne
    for row in reader:
        adresse = row[0]
        siren = row[1]
        adresses_siren[adresse] = siren

# Lire le fichier TravailSFE.csv et comparer les adresses avec celles du dictionnaire
with open('TravailSFE.csv', 'r', encoding='utf-8') as travail_file, \
     open('update.csv', 'w', encoding='utf-8', newline='') as update_file:
    reader = csv.reader(travail_file)
    writer = csv.writer(update_file)

# Écrire l'en-tête dans le fichier de sortie si nécessaire
    writer.writerow(['Adresse', 'SIREN'])

for row in reader:
        adresse = row[0]
        if adresse in adresses_siren:
            writer.writerow([adresse, adresses_siren[adresse]])

print("La comparaison est terminée. Les résultats ont été écrits dans 'update.csv'.")

On utilise csv.reader pour lire les fichiers ligne par ligne.
On passe les en-têtes avec next(reader) si votre fichier a des en-têtes.

Le dictionnaire adresses_siren associe chaque adresse à son SIREN, facilitant la recherche rapide.

On vérifie si chaque adresse du fichier TravailSFE.csv est présente dans le dictionnaire.
Si elle l’est, on écrit l’adresse et le SIREN correspondant dans le fichier update.csv.

Bruno83200_6929 · Answer

IndexError: list index out of range means that the program is trying to access an element in a list (here row[1] for the SIREN), but the line in question does not contain enough elements (columns) to access that index.

This can happen for several reasons:

Some lines in your CSV file do not contain two columns.
There may be empty lines.

There may be formatting issues in the file (such as newline characters or incorrect delimiters).

Check that each line has exactly two columns in the file Recherche.csv.

Add error checks to handle empty or badly formatted lines.

I kept your code, I will come back and provide a version of the code that adds an extra check to ensure that each line contains at least two columns before attempting to access the indices.

Bruno83200_6929 · Answer

Well, I added the extra check in the code.  import csv # Read the file Recherche.csv and create a dictionary for addresses and SIREN addresses_siren = {} with open('Recherche.csv', 'r', encoding='utf-8') as recherche_file: reader = csv.reader(recherche_file) next(reader) # If your file has a header line, otherwise remove this line for row in reader: # Check that the line has exactly two columns (address and SIREN) if len(row) < 2: print(f"Row ignored (missing column) : {row}") continue # Move to the next row if the line does not contain 2 columns address = row[0] siren = row[1] adresses_siren[adresse] = siren # Read the file TravailSFE.csv and compare the addresses with those in the dictionary with open('TravailSFE.csv', 'r', encoding='utf-8') as travail_file, \ open('update.csv', 'w', encoding='utf-8', newline='') as update_file: reader = csv.reader(travail_file) writer = csv.writer(update_file) # Write the header to the output file if necessary writer.writerow(['Adresse', 'SIREN']) for row in reader: if len(row) == 0: # Check if the row is empty print(f"Empty row ignored : {row}") continue adresse = row[0] if adresse in adresses_siren: writer.writerow([adresse, adresses_siren[adresse]]) print("The comparison is finished. The results have been written to 'update.csv'.")

Lisana_69 · Answer

 import csv adresse_siren = {} with open('BDS.csv', 'r', encoding='utf-8') as recherche_file: reader = csv.reader(recherche_file) next(reader) for row in reader: adresse = row[0] siren = row[1] adresse_siren[adresse] = siren with open('BDT.csv', 'r', encoding='utf-8') as travail_file, \ open('update.csv', 'w', encoding='utf-8', newline='') as update_file: reader = csv.reader(travail_file) writer = csv.writer(update_file) writer.writerow(['adresse', 'siren']) for row in reader: adresse = row[0] if adresse in adresse_siren: writer.writerow([adresse, adresse_siren[adresse]]) print("La comparaison est terminée. Les résultats ont été écrits dans 'update.csv'.")

La première image est le CSV de BDS (anciennement recherche) et la deuxième est le CSV de BDT (Travail SFE).

Je vous ai aussi renvoyé le programme, peut être cela vous permettra d'identifier mieux mon erreur.

Lisana_69 · Answer

I just tried the new program: it displays the ignored lines in the Python console, but the update file is empty.

Bruno83200_6929 · Answer

If the update.csv file is empty, it means that the program did not find any matches between the addresses in the BDT.csv file and those in the BDS.csv file. This can be due to several reasons, namely:

Inconsistency in the address format (e.g., extra spaces, uppercase/lowercase letters, accents).

Case sensitivity issue (uppercase/lowercase): the addresses might be written differently in the two files, making comparison impossible.

Subtle differences in addresses (such as commas or different abbreviations, for example, "Rue" instead of "R.").

I will add a function to the code with normalize_string() to convert addresses to lowercase, remove accents (é, è, à, etc.) and extra spaces. This will make the addresses in the two files comparable even if they differ slightly in case or accents.

If an address from the BDT.csv file does not match any address in BDS.csv, it will be displayed in the console. This will help you identify any potential differences.

I will modify the code and send it to you. We will achieve this; don’t despair, it’s only a formatting issue for me.

Lisana_69 · Answer

The update file remains empty even after modifying the program. I don't understand why, because I have checked that the addresses of the BDS file match those of the BDT.

Bruno83200_6929 · Answer

OK

The message you obtained shows that the address in the file contains unexpected characters. The ASCII code [32, 59, 59, 59, 59] represents the following characters:

32: a space ( ),
59: a semicolon (;).

This indicates that some lines in your CSV files contain unexpected or badly formatted characters, such as consecutive semicolons (;;;;). These characters may result from incorrect formatting in the original file or mishandling of the CSV file.

I will re-prepare a script for you. I’ll send it to you later.

Lisana_69 · Answer

Okay I understand No problem see you later and thank you very much.

mamiemando · Answer

Hello,

So that everyone can test the proposed programs, would it be possible to share the CSV files in question?

Have you considered pandas ? Besides the fact that it is possible to easily load CSV files (see pd.read_csv), pandas provides numerous very efficient primitives for data manipulation. If I understand correctly, the goal here is to find a join between the two files based on the siren column (if that's the case, you can use pd.join). To export a dataframe, use the method to_csv.

Example :

fichier1.csv

nom,prenom,siren solo,han,1111 skywalker,luke,1111 the hutt,jabba,0 vador,dark,333

fichier2.csv

siren,cause 1111,rebellion 333,empire

toto.py

#!/usr/bin/env python3 import pandas as pd df1 = pd.read_csv("fichier1.csv") print(df1) print("-" * 50) df2 = pd.read_csv("fichier2.csv") print(df2) print("-" * 50) df = df1.set_index("siren").join(df2.set_index("siren")) print(df) print("-" * 50) print(df.to_csv()

Result :

 nom prenom siren 0 solo han 1111 1 skywalker luke 1111 2 the hutt jabba 0 3 vador dark 333 -------------------------------------------------- siren cause 0 1111 rebellion 1 333 empire -------------------------------------------------- nom prenom cause siren 1111 solo han rebellion 1111 skywalker luke rebellion 0 the hutt jabba NaN 333 vador dark empire -------------------------------------------------- siren,nom,prenom,cause 1111,solo,han,rebellion 1111,skywalker,luke,rebellion 0,the hutt,jabba, 333,vador,dark,empire

Good luck

Lisana_69 · Answer

Thank you for your response but unfortunately I cannot share the files because they are business contacts. But I will try to think about it with the pandas function Thank you very much

Bruno83200_6929 · Answer

Semicolons could be present instead of valid colons, which means the CSV files are not correctly structured or read improperly.

Here is the modified code to use a semicolon as the separator, in case that is the case in your files :

import csv import unicodedata # Function to normalize addresses (lowercase, remove accents) def normalize_string(s): s = s.strip().lower() # Remove spaces and convert to lowercase s = unicodedata.normalize('NFKD', s).encode('ASCII', 'ignore').decode('ASCII') # Remove accents return s adresse_siren = {} # Read the BDS.csv file with semicolon separator with open('BDS.csv', 'r', encoding='utf-8') as recherche_file: reader = csv.reader(recherche_file, delimiter=';') # Specify the separator next(reader) # Skip header row if present for row in reader: if len(row) < 2: # Check that the line contains at least 2 columns (address, siren) print(f"Ligne ignorée (colonne manquante ou mal formatée) : {row}") continue # Skip incorrect rows adresse = normalize_string(row[0]) # Normalize the address siren = row[1].strip() # Remove extra spaces for the siren print(f"Ajout au dictionnaire : {row[0]} -> {siren} (normalisé : {adresse})") adresse_siren[adresse] = siren # Read the BDT.csv file with semicolon separator and compare addresses with open('BDT.csv', 'r', encoding='utf-8') as travail_file, \ open('update.csv', 'w', encoding='utf-8', newline='') as update_file: reader = csv.reader(travail_file, delimiter=';') # Specify the separator writer = csv.writer(update_file) writer.writerow(['adresse', 'siren']) # Write header in the output file for row in reader: if len(row) == 0: # Check if the line is empty print(f"Ligne vide ignorée : {row}") continue adresse = normalize_string(row[0]) # Normalize the address in BDT.csv print(f"Comparaison de : {row[0]} (normalisé : {adresse})") if adresse in adresse_siren: print(f"Adresse correspondante trouvée : {row[0]} -> {adresse_siren[adresse]}") writer.writerow([row[0], adresse_siren[adresse]]) # Use the original address in the output file else: print(f"Adresse non trouvée : {row[0]} (normalisé : {adresse})") print("La comparaison est terminée. Les résultats ont été écrits dans 'update.csv'.")

Si vos fichiers utilisent le point-virgule comme séparateur (ce qui semble être le cas vu les caractères ;;;;), cela permet au programme de correctement lire les colonnes des fichiers.

Si cette solution ne fonctionne pas, il pourrait être utile de vérifier manuellement les fichiers CSV pour vous assurer que les colonnes sont bien séparées par des virgules ou des points-virgules.

Si vos fichiers ne sont pas bien structurés, essayez de les réexporter avec un outil comme Excel ou un éditeur de texte pour vous assurer qu'ils respectent le format CSV correct (avec des séparateurs clairs).

La manipulation des fichiers CSV est toujours très délicate, les erreurs sont souvent liées au formatage des fichiers. Il est toujours plus judicieux d'utiliser des outils comme Excel et LibreOffice Calc (qui est gratuit et open source) qui sont des tableurs populaires pour manipuler des fichiers CSV. Si vous n'y parvenez pas avec python, je vous conseille d'utiliser un de ces tableurs.

Vous pouvez copier/coller les adresses d'un fichier dans un nouveau tableau, puis utiliser des formules comme RECHERCHEV pour associer les SIREN aux adresses correspondantes.

Exemple de formule RECHERCHEV :

=RECHERCHEV(A2;BDS!A:B;2;FAUX)

Je ne sais plus quelle solution vous apporter.

Lisana_69 · Answer

Avec le code nic aussi

Compare two CSV files with Python

13 answers

Startup error message

Black screen on hp pc

Case fan not working

Username and password

How to delete quarantine files in defender?

Taskbar icon visibility

Texture issues in enshrouded

Activation of the esu program for updates on windows 10

Copy/paste scanned text jpeg format

Remove marker stain from pc screen