text manipulation - main

25 Sep 2005


      Hello everyone.
How do you scan a text file in the following way?
It will have multiple occurrences of a label in sucessive records.  For 
instance:
****newrecord****
*TTTT lorum ipsem 
*TTTT feugiat, ullamcorper 
*RRRR suscipit dolor.....
*RRRR fuerat inscipior
*SSS nunquam alebit
****endofrecord****
This is how it actually is - the labels are all in caps preceded by *, the 
start and end of the records are indicated as shown, and each line consists 
of a field label followed by field content.  I am trying to make this into a 
csv file for import into another database, but am having problems because the 
number of identical field labels varies from record to record.  The starting 
database appears to have allowed the user to create any number of identically 
labelled fields. So the number of fields in a record varies from 30 to 50, 
and there can be 5 -10 duplicate field labels in any record.  It is several 
thousand records in size.
The problem is to go through the file, and remove only the second and 
subsequent occurences of a label, putting the content from the second and 
subsequent records in the record with the first instance.  So in the above 
example, you apply the process and end up with
*TTTT lorum ipsem feugiat, ullamcorper 
*RRRR suscipit dolor fuerat inscipior
*SSS nunquam alebit
Once the file is like this, it will be simple to go through and change the 
carriage returns into tabs, and then do the import.
I realise that uniq will find duplicates, and that tr will do replaces, and 
that you can probably pipe one into the other...or use GAWK?  But can't seem 
to figure out how exactly to make them all do exactly this.  My main problem 
is how to just delete the second and subsequent duplicate labels, instead of 
either all occurrences of the label, or the whole record with the duplicate 
label.  Any ideas?
Regards & thanks in advance to anyone patient enough to help.
Peter Berrie