Chat GPT code very slow on large file volume

Lilie3887 Posted messages 19 Status Membre -  
 Anonymous user -

Hello everyone,

I'm working on a big project and I have to admit that it's ChatGPT that created my Python scripts.

I would like to know if you know of any website or people who understand the Python code from ChatGPT.

ChatGPT generated a lot of Python code for me (everything I need), the problem is that it's not fast at all, very very slow with hundreds of folders containing about 100,000 CSV files...

Do you know anyone or anything that can help me please?

Thank you in advance ????

7 réponses

PierrotLeFou
 

AI is not as intelligent as we think. I tested Google Gemini. Some results were extraordinary and others terrible.
It depends on what they have in their database.
The Python code generated by ChatGPT is no more (or less) difficult to understand than Python code generated by a human.
And what kind of performance do you expect with "hundreds" of files containing more than 100,000 items?
As @mariam-j said, you would need a compiled language like C, C++, or Rust (provided you know them).
And it depends on the complexity of the method used. Maybe a course in algorithms could help.

0
Lilie3887 Posted messages 19 Status Membre
 

Thanks for the info PierrotLeFou, I'm going to look into what C, C++, and Rust are, as I've already heard of them. My external hard drive, where all the files are, is an HDD, so it's even slower.

But the code that ChatGPT suggests produces 100,000 files in 9 hours; I could have accepted 3 or 4 hours, but 9 hours is way too much.

In any case, thanks, I'm going to do some research.

0
PierrotLeFou
 

I haven't analyzed your code but I have doubts about the relevance of threading.
All your files are on the same HDD and, as far as I know, there is only one access to the disk.
I have previously worked on a multi-user system where the system managed disk accesses in a special way.
It knew where the read heads were located and tried to satisfy the request closest to the current position.
There was a priority system that prevented any request from getting stuck in the queue.
But in your case, on modern systems, we don't do that anymore.
We can assume (...) that the sectors associated with a given file are fairly close to each other.
I wonder if sequential processing (one file after another) wouldn't be faster.
Not knowing exactly what "hundreds" of files means, I assumed there were 500.
I assumed there were indeed 100,000 entries in each file.
And 9 hours gives 32,400 seconds.
I reach the conclusion that it takes 648 microseconds per entry. Even in Python we can do better.

0
Lilie3887 Posted messages 19 Status Membre
 

What can I ask the AI chat? So that it can improve my Python code? Because processing 100,000 files takes me 38,521 seconds. Almost 11 hours, so I'm feeling down. I don't know what to do anymore. Now I'm asking ChatGPT to create the same code but with Rust, and since this morning, that is to say 10 am, I've been with it, and it has made me try at least 20 codes, and all of them have errors. Not even one working code since this morning...

0
PierrotLeFou > Lilie3887 Posted messages 19 Status Membre
 

I noticed that AIs, especially Google Gemini for me, are rather bad with Rust.
As I said, AIs are not really intelligent.
They go into their database to find code that matches the request.
Since Rust is relatively new, there is probably little code in this language.
I assume you are not very familiar with programming.
In principle, you would be advised to write your code yourself.
You can ask it not to use threading for the reason I mentioned.
AIs do a good semantic analysis.
You can try to explain your setup in your own words. For example, that all your files are on the same HDD and you don't have multiple parallel accesses.
Is there really no code in Rust that works?
Have you tried in C or C++? (I think you would have better luck in C++)

0
Lilie3887 Posted messages 19 Status Membre > PierrotLeFou
 

I'm going to try C++ but do you have a really knowledgeable AI in C++ please?

0
PierrotLeFou > Lilie3887 Posted messages 19 Status Membre
 

You want to have your problems solved by an AI. It's at your own risk.
I already mentioned "Google Gemini"
gemini.google.com

0
Lilie3887 Posted messages 19 Status Membre > PierrotLeFou
 

Thank you for giving me the info about Google Gemini.
Unfortunately, I only know about AI, but if you know anyone who could help me with my problem, I would appreciate it.

0
mamiemando Posted messages 33537 Registration date   Status Modérateur Last intervention   7 927
 

Hello,

To effectively address your issue (see #6), you should forget about C/C++/Rust which won't bring you much, and instead focus on pandas.

Next, I think you won't learn anything with ChatGPT and that it's not the right approach. ChatGPT should be seen as an enhanced search engine. In my experience, ChatGPT responds fairly correctly on simple problems (understood as things you could have found with a Google search or on Stack Overflow) but not on more advanced issues. How many times have I seen ChatGPT telling me nonsense? And the funniest part is that when you point out the mistake to ChatGPT, it does its mea culpa. Quickly it goes in circles and doesn't let you move forward.

As Pierrot says, an AI is NOT intelligent (and I know a thing or two about it, I work in AI). Specialists would tell you that it is merely a stochastic parrot, which in everyday language means that it only provides a "average" of everything it has seen based on what you ask it and the current context.

My advice if you want to progress and especially succeed would be:

  • do your research on the right approaches to solve a problem, possibly with the help of ChatGPT,
  • learn to use the technical solutions that seem pertinent: there are numerous documentations and tutorials available online, in all formats,
  • be critical of what you read (especially when it's something generated by an AI)
    • In your case, ChatGPT is steering you towards Python. Why? It is indeed the language commonly used by data scientists for data volumes that are not too astronomical (which is your case).
    • Then ChatGPT is guiding you towards polars. Why? To accelerate the loading of CSV files.
      • Generally, "one" uses pandas (and thus can find many resources online showing how to use it).
      • I have never used polars, so I have no opinion, it might be good but if I were to use it, I would first check that polars indeed offers all the primitives I need.
      • According to this link, polars seems more performant than pandas. That said, it's not the fastest, so why polars?
  • always try to understand what you are writing (even if it means reading the documentation for each function called in your code).

Then specify your problem. Your initial message and #6 are too vague for us to know what you want to do.

  • What is the structure of the CSV files?
  • What are the parameters of your program?
    • We don't even know what we are supposed to pass as parameters to your program!
  • What is the expected result? How do we obtain it?
    • Providing a minimal example would help to better understand.
  • What is the context? Do you need to query this file set multiple times, with different queries?
    • If so, serializing the files (see pickle), or even the resulting dataframe, would probably be a good idea (you wouldn't have to pay the parsing of the CSV files every time you need to load one).

Regarding Pierrot's remark #9 on threading, I think it is debatable. Parallelizing the loading of data files probably doesn't have much interest, but processing them does. If the loading time is relatively negligible, in my opinion, that doesn't pose any real problem. Furthermore, managing several concurrent accesses to a hard drive is the operating system's problem.

Then, the proposed processing (lines 60 to 80) suggested by ChatGPT seems catastrophic to me. For dataframes, if we want efficient processing, we avoid loops as much as possible; otherwise, we pay the cost of "Python is an interpreted language" that was mentioned in #1. That's why we try as much as possible to use vectorized operations. I have never used polars (personally, I generally use pandas) and we would never loop over each cell to sum a column of a dataframe (see for example pd.sum).

So, on one side, it’s good to use polars to load CSV files faster, but if it processes the file this way afterward, it’s not surprising that it would be horrifically slow. The problem is that I personally didn't understand what you wanted to extract/aggregate/calculate, so I can't tell you what to look at.

Good luck

0
PierrotLeFou
 

@mamiemando
I asked Google Gemini to solve a problem for which I already knew the solution.
It gave me a very ineffective answer. I submitted my own, which was much better.
I pointed out the situation to it.
Surprisingly, it understood why my solution was better than its own.
I started a new session by asking it the same question. It gave me its old solution again, not the one for which it supposedly understood the benefit.

0
mamiemando Posted messages 33537 Registration date   Status Modérateur Last intervention   7 927 > PierrotLeFou
 

If you're trying to tell me that when it comes to development, we shouldn't count on AI to do anything other than some groundwork, I already knew that :-)

0
PierrotLeFou > mamiemando Posted messages 33537 Registration date   Status Modérateur Last intervention  
 

Nevertheless, I find it easier to find my documentation about Rust on Google Gemini than the official Rust documentation.

0
mamiemando Posted messages 33537 Registration date   Status Modérateur Last intervention   7 927 > PierrotLeFou
 

For me, it's off-topic, but I'll respond briefly. Finding a library, module, class, or function to meet a need is a rather "simple" question. In fact, we can often manage without AI (a traditional Google search is often sufficient to find a clear resource like Stack Overflow). One of the many issues with AI is, as you mentioned, that it can quickly go in circles on a slightly more complicated problem and/or provide something incorrect, so it's a relatively reliable crutch, but not completely reliable. And I'm not even talking about environmental aspects.

0
Anonymous user
 

Hello,

This question is starting to get old, but oh well.

The day AI correctly answers the question "what was the color of Henry IV's White Horse," I will start to trust it completely.

When you ask a question to an AI, make sure to phrase it correctly; an AI remains basic.

If you ask for code, specify the desired language.

If you don't know the basics of that language, start by asking for the basics.

If you want precise code, tell it to factor the code to achieve a more concise result.

Don't forget that an AI can't see beyond the tip of its nose; it isn't capable of going beyond the third question.

An AI is still amazing in its production capability. I trust it with the documentation of my codes, clearly stating not to modify them.

In conclusion, an AI will provide you with all the answers you need, but you must not rush into it.

For each task, ask it which programming language is best suited to address your problem, which vocabulary base of the chosen language is relevant, and to provide you with a task outline that your program will need to accomplish.

AI should remain an aid to analysis, not just a pure code generator. The code an AI produces is ultimately unusable.

0
mariam-j Posted messages 44 Registration date   Status Membre Last intervention   39
 

Hello,

Python is not fast (interpreted language)

For speed: C, C++, and other compiled languages.


-1
Lilie3887 Posted messages 19 Status Membre
 

Thank you very much Mariam-j I will find out about C, C++ thank you for your help

0
mamiemando Posted messages 33537 Registration date   Status Modérateur Last intervention   7 927 > Lilie3887 Posted messages 19 Status Membre
 
  • This is bad advice and it's also half wrong, as it forgets that in Python you can call compiled code (typically C/C++). This is why libraries like pandas or openpyxl allow Python to have comparable performance for processing such files.
  • If you get into C/C++, you will face other difficulties: learning a new language, new concepts that you don't worry about in Python, needing to install an environment to compile C/C++ code, and many more.
0
mariam-j Posted messages 44 Registration date   Status Membre Last intervention   39
 

It also depends on what you do in your files.

100,000 isn't that huge.


-1
Lilie3887 Posted messages 19 Status Membre
 

I am conducting a VLOOKUP, then writing the sum of this lookup result in the column associated with the VLOOKUP, adding the date of this VLOOKUP while shifting the already existing data in the file to the right, and deleting any column from column 41 onwards.

import os import polars as pl import time from concurrent.futures import ThreadPoolExecutor, as_completed DOSSIIR_CSV = r"D:\PYTHON\VALEUR REMPLACER ZIP" FICHIER_REBASE = r"D:\PYTHON\REBASE.csv" def get_dates_and_dicts(): df = pl.read_csv(FICHIER_REBASE, separator=";", encoding="utf8-lossy", has_header=False) dicts = [ { str(k).strip().replace('"', '').replace("'", ''): str(v).strip().replace('"', '').replace("'", '') for k, v in zip(df[:, i].to_list(), df[:, i+1].to_list()) } for i in range(0, 19, 3) ] def get_date(col): return str(df[1, col]) if df[1, col] is not None else (str(df[0, col]) if df[0, col] is not None else "Date Not Found") dates = [get_date(i) for i in range(0, 19, 3)] return (*dicts, *dates) def make_unique_columns(columns): seen = {} unique_columns = [] for name in columns: if name in seen: seen[name] += 1 new_name = f"{name}_{seen[name]}" else: seen[name] = 0 new_name = name unique_columns.append(new_name) return unique_columns def traiter_fichier(fichier, *args): dicts = args[:7] dates = args[7:] chemin = os.path.join(DOSSIIR_CSV, fichier) try: if os.path.getsize(chemin) == 0: return f"❌ {fichier} error: empty file" with open(chemin, 'r', encoding='utf8', errors='ignore') as f: first_line = f.readline() nb_columns = len(first_line.strip().split(";")) unique_names = [f"column_{i}" for i in range(nb_columns)] df = pl.read_csv(chemin, separator=";", has_header=False, new_columns=unique_names, encoding="utf8-lossy").fill_null("") n_rows = df.height if df.width < 85: df = df.hstack([pl.Series([""] * df.height).alias(f"column_{i}") for i in range(df.width, 85)]) rows = df.to_numpy() def calculer_somme(dico, row): total = 0 for j in range(20): key = str(row[j]).strip().replace('"', '').replace("'", '') val = dico.get(key, "") try: total += float(val) if val.replace(".", "", 1).isdigit() else 0 except: total += 0 return str(int(total)) for col_index, dico in zip(range(20, 27), dicts): somme = [calculer_somme(dico, row) for row in rows] for i in range(n_rows): if str(rows[i][col_index]).strip(): for j in range(len(rows[i]) - 1, col_index, -1): rows[i][j] = rows[i][j - 1] rows[i][col_index] = somme[i] final_columns = make_unique_columns([f"column_{i}" for i in range(len(rows[0]))]) df = pl.DataFrame({ col: ["" if val is None else str(val) for val in rows[:, idx]] for idx, col in enumerate(final_columns) }) date_cols = [pl.Series(final_columns[i], [date] + df[1:, i].to_list()) for i, date in zip(range(20, 27), dates)] df = df.with_columns(date_cols) df = df[:, 0:41] df.write_csv(chemin, separator=";", include_header=False, quote_style="never") return f"✅ {fichier} OK" except Exception as e: return f"❌ {fichier} error: {e}" if __name__ == "__main__": start = time.time() args = get_dates_and_dicts() total = 0 with ThreadPoolExecutor(max_workers=os.cpu_count()) as executor: futures = [ executor.submit(traiter_fichier, f.name, *args) for f in os.scandir(DOSSIIR_CSV) if f.name.endswith(".csv") and f.is_file() ] for f in as_completed(futures): print(f.result()) total += 1 print(f"\n✅ {total} files processed successfully!") print(f"⏱️ Total execution time: {time.time() - start:.2f} seconds") 
0
[Dal] Posted messages 6205 Registration date   Status Contributeur Last intervention   1 108 > Lilie3887 Posted messages 19 Status Membre
 

Hello Lillie3887,

I asked Grok3 (the AI from X) to analyze this code, then I told it, "This script has issues with execution time being too slow. What solutions can you propose?". This resulted in the following exchange:

https://x.com/i/grok/share/ylqcVgTBKX10zUe7KuNv4lPpW

(See the details of the optimization proposals, some of which were mentioned by mamiemando) and which Grok3 provides code or implementation illustrations for.

Grok3 concludes:

Estimated gains

  • Vectorization: Can reduce sum computation time by 50 to 90% depending on data size.

  • I/O Optimization: 10 to 30% reduction in read/write time.

  • Parallelism with ProcessPoolExecutor: Up to 2x faster if calculations dominate.

  • Parquet (if applicable): Up to 5x faster for read/write on large files.

Test each optimization individually with a subset of files and profile to confirm the gains. If you have details on file size or the hardware used, I can refine these recommendations!

Grok3 is not particularly known as an AI for programming (Gemini, Claude Sonnet, and ChatGPT are more reputed for that), but that's what I had on hand, and it provides avenues to explore.

In fact, it's a blind optimization because without a data set and a simplified example of the starting data and the result you want to achieve, we can't really understand what you want to do or if such solutions are appropriate. You also haven't answered the various questions posed by mamiemando, which are questions a professional would ask (the answers to which probably contain information you haven't provided to your AI either).

Programming is a profession, or at least, it can be learned. A programmer solves problems with an appropriate algorithm and code. For that, they must understand the problem. A programmer can then use an AI as another tool to obtain a simple code they can assess for relevance against the problem they have already understood. If you don't know how to program in Python (or programming in general), nor how to express your problem, then using a tool that guesses the nature of your request and provides responses it deems most statistically correct based on its training data—and which can, even with a clearly stated problem, give you false, incomplete, approximate, or irrelevant answers (given the current state of technology)—is problematic.

The fact that the code is not optimized is a lesser evil if it does what you really want it to do... I assume you've checked that.

If you haven't done so, I highly recommend that you do.

Dal

0