Chat GPT code very slow on large file volume

Question

Hello everyone,

I'm working on a big project and I have to admit that it's ChatGPT that created my Python scripts.

I would like to know if you know of any website or people who understand the Python code from ChatGPT.

ChatGPT generated a lot of Python code for me (everything I need), the problem is that it's not fast at all, very very slow with hundreds of folders containing about 100,000 CSV files...

Do you know anyone or anything that can help me please?

Thank you in advance ????

mariam-j · Answer

Hello, Python is not fast (interpreted language) For speed: C, C++, and other compiled languages.

PierrotLeFou · Answer

AI is not as intelligent as we think. I tested Google Gemini. Some results were extraordinary and others terrible.
It depends on what they have in their database.
The Python code generated by ChatGPT is no more (or less) difficult to understand than Python code generated by a human.
And what kind of performance do you expect with "hundreds" of files containing more than 100,000 items?
As @mariam-j said, you would need a compiled language like C, C++, or Rust (provided you know them).
And it depends on the complexity of the method used. Maybe a course in algorithms could help.

Lilie3887 · Answer

Thanks for the info PierrotLeFou, I'm going to look into what C, C++, and Rust are, as I've already heard of them. My external hard drive, where all the files are, is an HDD, so it's even slower.

But the code that ChatGPT suggests produces 100,000 files in 9 hours; I could have accepted 3 or 4 hours, but 9 hours is way too much.

In any case, thanks, I'm going to do some research.

mariam-j · Answer

It also depends on what you do in your files. 100,000 isn't that huge.

PierrotLeFou · Answer

I haven't analyzed your code but I have doubts about the relevance of threading.
All your files are on the same HDD and, as far as I know, there is only one access to the disk.
I have previously worked on a multi-user system where the system managed disk accesses in a special way.
It knew where the read heads were located and tried to satisfy the request closest to the current position.
There was a priority system that prevented any request from getting stuck in the queue.
But in your case, on modern systems, we don't do that anymore.
We can assume (...) that the sectors associated with a given file are fairly close to each other.
I wonder if sequential processing (one file after another) wouldn't be faster.
Not knowing exactly what "hundreds" of files means, I assumed there were 500.
I assumed there were indeed 100,000 entries in each file.
And 9 hours gives 32,400 seconds.
I reach the conclusion that it takes 648 microseconds per entry. Even in Python we can do better.

mamiemando · Answer

Hello,

To effectively address your issue (see #6), you should forget about C/C++/Rust which won't bring you much, and instead focus on pandas.

Next, I think you won't learn anything with ChatGPT and that it's not the right approach. ChatGPT should be seen as an enhanced search engine. In my experience, ChatGPT responds fairly correctly on simple problems (understood as things you could have found with a Google search or on Stack Overflow) but not on more advanced issues. How many times have I seen ChatGPT telling me nonsense? And the funniest part is that when you point out the mistake to ChatGPT, it does its mea culpa. Quickly it goes in circles and doesn't let you move forward.

As Pierrot says, an AI is NOT intelligent (and I know a thing or two about it, I work in AI). Specialists would tell you that it is merely a stochastic parrot, which in everyday language means that it only provides a "average" of everything it has seen based on what you ask it and the current context.

My advice if you want to progress and especially succeed would be:

do your research on the right approaches to solve a problem, possibly with the help of ChatGPT,
learn to use the technical solutions that seem pertinent: there are numerous documentations and tutorials available online, in all formats,
be critical of what you read (especially when it's something generated by an AI)
- In your case, ChatGPT is steering you towards Python. Why? It is indeed the language commonly used by data scientists for data volumes that are not too astronomical (which is your case).
- Then ChatGPT is guiding you towards polars. Why? To accelerate the loading of CSV files.
  - Generally, "one" uses pandas (and thus can find many resources online showing how to use it).
  - I have never used polars, so I have no opinion, it might be good but if I were to use it, I would first check that polars indeed offers all the primitives I need.
  - According to this link, polars seems more performant than pandas. That said, it's not the fastest, so why polars?
always try to understand what you are writing (even if it means reading the documentation for each function called in your code).

Then specify your problem. Your initial message and #6 are too vague for us to know what you want to do.

What is the structure of the CSV files?
What are the parameters of your program?
- We don't even know what we are supposed to pass as parameters to your program!
What is the expected result? How do we obtain it?
- Providing a minimal example would help to better understand.
What is the context? Do you need to query this file set multiple times, with different queries?
- If so, serializing the files (see pickle), or even the resulting dataframe, would probably be a good idea (you wouldn't have to pay the parsing of the CSV files every time you need to load one).

Regarding Pierrot's remark #9 on threading, I think it is debatable. Parallelizing the loading of data files probably doesn't have much interest, but processing them does. If the loading time is relatively negligible, in my opinion, that doesn't pose any real problem. Furthermore, managing several concurrent accesses to a hard drive is the operating system's problem.

Then, the proposed processing (lines 60 to 80) suggested by ChatGPT seems catastrophic to me. For dataframes, if we want efficient processing, we avoid loops as much as possible; otherwise, we pay the cost of "Python is an interpreted language" that was mentioned in #1. That's why we try as much as possible to use vectorized operations. I have never used polars (personally, I generally use pandas) and we would never loop over each cell to sum a column of a dataframe (see for example pd.sum).

So, on one side, it’s good to use polars to load CSV files faster, but if it processes the file this way afterward, it’s not surprising that it would be horrifically slow. The problem is that I personally didn't understand what you wanted to extract/aggregate/calculate, so I can't tell you what to look at.

Good luck

Anonymous user · Answer

Hello,

This question is starting to get old, but oh well.

The day AI correctly answers the question "what was the color of Henry IV's White Horse," I will start to trust it completely.

When you ask a question to an AI, make sure to phrase it correctly; an AI remains basic.

If you ask for code, specify the desired language.

If you don't know the basics of that language, start by asking for the basics.

If you want precise code, tell it to factor the code to achieve a more concise result.

Don't forget that an AI can't see beyond the tip of its nose; it isn't capable of going beyond the third question.

An AI is still amazing in its production capability. I trust it with the documentation of my codes, clearly stating not to modify them.

In conclusion, an AI will provide you with all the answers you need, but you must not rush into it.

For each task, ask it which programming language is best suited to address your problem, which vocabulary base of the chosen language is relevant, and to provide you with a task outline that your program will need to accomplish.

AI should remain an aid to analysis, not just a pure code generator. The code an AI produces is ultimately unusable.

Chat GPT code very slow on large file volume

7 answers

Doctolib document

Doctolib plans to use your health data for its ai research

Google maps does not recognize my current location [pc]

My iphone 16

Persistent anchoring

Chpisir usb key formatting mode on bbox tv sensation?

Gmx email address blocked

Where can i find my free pin code?

How to plug a usb cord into a tv

Very slow computer