Hello everyone.
How do you scan a text file in the following way?
It will have multiple occurrences of a label in sucessive records. For instance:
****newrecord**** *TTTT lorum ipsem *TTTT feugiat, ullamcorper *RRRR suscipit dolor..... *RRRR fuerat inscipior *SSS nunquam alebit ****endofrecord****
This is how it actually is - the labels are all in caps preceded by *, the start and end of the records are indicated as shown, and each line consists of a field label followed by field content. I am trying to make this into a csv file for import into another database, but am having problems because the number of identical field labels varies from record to record. The starting database appears to have allowed the user to create any number of identically labelled fields. So the number of fields in a record varies from 30 to 50, and there can be 5 -10 duplicate field labels in any record. It is several thousand records in size.
The problem is to go through the file, and remove only the second and subsequent occurences of a label, putting the content from the second and subsequent records in the record with the first instance. So in the above example, you apply the process and end up with
*TTTT lorum ipsem feugiat, ullamcorper *RRRR suscipit dolor fuerat inscipior *SSS nunquam alebit
Once the file is like this, it will be simple to go through and change the carriage returns into tabs, and then do the import.
I realise that uniq will find duplicates, and that tr will do replaces, and that you can probably pipe one into the other...or use GAWK? But can't seem to figure out how exactly to make them all do exactly this. My main problem is how to just delete the second and subsequent duplicate labels, instead of either all occurrences of the label, or the whole record with the duplicate label. Any ideas?
Regards & thanks in advance to anyone patient enough to help.
Peter Berrie
On 25-Sep-05 Peter wrote:
Hello everyone.
How do you scan a text file in the following way?
It will have multiple occurrences of a label in sucessive records. For instance:
****newrecord**** *TTTT lorum ipsem *TTTT feugiat, ullamcorper *RRRR suscipit dolor..... *RRRR fuerat inscipior *SSS nunquam alebit ****endofrecord****
The following is a skeleton 'awk' script which basically does what you want, though I'm not sure what happened (or was supposed to happen) to the "....." above in your statement of desired output below, so I've just left them in. This could be changed if desired.
Awk script (in file "temp0.awk" which has 755 permissions):
#! /bin/bash awk ' /****newrecord****/{ print $0 next } /****endofrecord****/{ for( i in item ){ print i item[i] } print $0 next } {label=$1} {$1=""} {item[label]=item[label] $0} '
[Note the opening and closing quotes '...' in the above script] Copy of session using your data above:
./temp0.awk << EOT
****newrecord**** *TTTT lorum ipsem *TTTT feugiat, ullamcorper *RRRR suscipit dolor..... *RRRR fuerat inscipior *SSS nunquam alebit ****endofrecord**** EOT
****newrecord**** *RRRR suscipit dolor..... fuerat inscipior *SSS nunquam alebit *TTTT lorum ipsem feugiat, ullamcorper ****endofrecord****
Note that the labels come out in a different order from the order they went in. This is a consequence of the fact that 'awk' arrays were used to store the data, using the labels as index variables. I think the fact that they appear sorted alphabetically is a coincidence -- 'awk' may access the index values in an array in arbitrary order. This way (arrays) of doing it ensures that the label only occurs once, and the contents of the array value at a particular label are built up with the chunks in the order in which they occur.
The purpose of the "next" statements is to ensure that the lines marking the beginning and end of the record do not participate in the construction of the array.
Feel free to come back for any further suggestions about refining this or similarl solutions.
Best wishes, Ted.
This is how it actually is - the labels are all in caps preceded by *, the start and end of the records are indicated as shown, and each line consists of a field label followed by field content. I am trying to make this into a csv file for import into another database, but am having problems because the number of identical field labels varies from record to record. The starting database appears to have allowed the user to create any number of identically labelled fields. So the number of fields in a record varies from 30 to 50, and there can be 5 -10 duplicate field labels in any record. It is several thousand records in size.
The problem is to go through the file, and remove only the second and subsequent occurences of a label, putting the content from the second and subsequent records in the record with the first instance. So in the above example, you apply the process and end up with
*TTTT lorum ipsem feugiat, ullamcorper *RRRR suscipit dolor fuerat inscipior *SSS nunquam alebit
Once the file is like this, it will be simple to go through and change the carriage returns into tabs, and then do the import.
I realise that uniq will find duplicates, and that tr will do replaces, and that you can probably pipe one into the other...or use GAWK? But can't seem to figure out how exactly to make them all do exactly this. My main problem is how to just delete the second and subsequent duplicate labels, instead of either all occurrences of the label, or the whole record with the duplicate label. Any ideas?
Regards & thanks in advance to anyone patient enough to help.
Peter Berrie
-------------------------------------------------------------------- E-Mail: (Ted Harding) Ted.Harding@nessie.mcc.ac.uk Fax-to-email: +44 (0)870 094 0861 Date: 25-Sep-05 Time: 18:02:01 ------------------------------ XFMail ------------------------------