[awk] split a file into multiple files
SolvedDIE -
Hello everyone,
I'm going around in circles on a script that must be 100% in awk, but I'm wondering if it's actually possible. I tried it with chatGPT but it gave me nonsense :-(
Here’s my problem:
- In the following file, I want to create a new file each time we encounter the word PAGE in the input file. The line containing the word PAGE is included at the beginning of the newly created file.
- If somewhere in the file, after the word PAGE, we find the word NIR, then the NIR variable is updated accordingly and used to name this file. The file will be named after the value of the NIR variable, followed by the extension ".txt".
Example:
- Input file:
totot titi PAGE tata tata dfdf fdf NIR un deux troix quatre tata dfdfd dfdfdf dfdf PAGE dfd fdfdfd dfdf dfddfdfdf NIR one two three four five dfdf df PAGE dfdf NIR dfdf dfdfd Expected result:
- The first file created is named un.txt and contains:
titi PAGE tata tata dfdf fdf NIR un deux troix quatre tata dfdfd The second file created is named one.txt and contains:
dfdfdf dfdf PAGE dfd fdfdfd dfdf dfddfdfdf NIR one two three four five dfdf - The third file created is named dfdf.txt and contains:
df PAGE dfdf NIR dfdf dfdfd Thank you in advance
4 answers
hello
try
$ awk 'BEGIN {while(getline < "file")if($0 ~ /NIR/)for(n=1; n<=NF; n++)if($n ~ /NIR/)t[++f]=$(n+1)} /PAGE/ {a=t[++x] ".txt"} a {print $0 > a }' file $ $ more *txt :::::::::::::: dfdf.txt :::::::::::::: df PAGE dfdf NIR dfdf dfdfd :::::::::::::: one.txt :::::::::::::: dfdfdf dfdf PAGE dfd fdfdfd dfdf dfddfdfdf NIR one two three four five dfdf :::::::::::::: un.txt :::::::::::::: titi PAGE tata tata dfdf fdf NIR un deux trois quatre tata dfdfd
Hello,
I rewrote the initial message because some phrasing was ambiguous, and I’m not surprised that ChatGPT stumbled (I don’t actually believe ChatGPT can properly solve a non-trivial programming exercise, but anyway).
That said, the problem definition is incomplete:
- What happens (and should happen) if the keyword NIR does not appear after the word PAGE?
- Do you still create a new file?
- If so, what do you name it?
Good luck
it works SENSATIONALLY WELL :)
thank you so much
I don't understand how t[++f] works compared to t[++x] if you could enlighten me
These are just two different variables, but in both cases, they refer to the indices of the array t.
- Firstly, when the program starts (in the BEGIN block), you populate an array t, and each time you encounter NIR, you record the name of the future file at the current index (noted as f).
- Then, you process the file itself, and each time you encounter the word PAGE, you increment a counter (noted arbitrarily as x) that allows you to retrieve the file name from t.
I take this opportunity to correct a small mistake in proposal #1, because if the file is not called "fichier", it won't work. Anyway, here's what mon_script.awk might look like:
#!/usr/bin/awk -f BEGIN { while (getline < ARGV[1]) { if ($0 ~ /NIR/) { for (n = 1; n <= NF; n++) { if ($n ~ /NIR/) { t[++f] = $(n + 1) } } } } } /PAGE/ { a = t[++x] ".txt" } a { print $0 > a } Some remarks along the way:
- The first if (0 ~ /NIR/) is redundant. In practice, it adds nothing. Indeed, this test itself forces the current line to be read, which will be read again in the following for loop. This means that lines containing NIR are read twice and lines not containing NIR are read once. If we remove this test, each line is only read once.
- The block /PAGE/ allows you to define a variable a, which corresponds to the file to which we will now write. If this variable is defined, then the 3rd block is executed each time we read a new line, and it writes it to the file a.
Once the mon_script.awk file is written, to run it:
awk -f mon_script.awk fichier.txt
Alternatively, you can give execution rights to the script, and thanks to the first line of the script, your shell will know it should use awk -f:
chmod a+x mon_script.awk ./mon_script.awk fichier.txt
Good luck
In awk, an array starts at 1 (not at 0 like in C, for example), and an undefined variable is equal to 0.
At the first occurrence of NIR f=0, so t[++f] is equal to t[1], at the next one t[2], and so on.
We read the file again.
At the first occurrence of PAGE x=0, so t[++x] is equal to t[1], and so on.
We reread the array t containing the name of the file to be created.