QUESTION :
I need to extract the second field of selected lines in a GEDCOM file. These lines are all of the following format:
% grep @ /tmp/XYZ | tail -5
0 @X701@ OBJE
0 @X702@ OBJE
0 @X750@ OBJE
0 @X765@ OBJE
0 @X766@ OBJE
But in the following,
% egrep "0 @[^@]@" /tmp/XYZ
% perl -CSD -p -i -e 's:0 @([^@])@ .*:ZYX 1:g;' /tmp/XYZ
the first finds nothing and the second changes nothing;
I don’t understand why.
The CSD
is because although the file is mostly ASCII, it contains some French, Polish, and Chinese, and is encoded UTF-8.
As far as I am aware, @
is not a special character for regular expressions.
Update: I am looking for the field that has the function of a primary key. It is always delimited by @
and therefore cannot contain an @
. Some lines might reference such a key, but it is only primary when the line starts with 0
. I must not match lines that contain other @
but that should be ensured by putting in a string-begin ^
. I must also not hit on lines of other formats—I used grep to show the format of the target lines, and tail to limit the size to less than five thousand.
ANSWER :
- If you might have lines that look like
60 @FOO@ blah
or
42.0 @记鬼四七@ quux
(and you don’t want to match them),
you should begin your regex with a^
; e.g.,^0 @…
. [^@]
will matchX
or7
.
To match any number of non-@
characters (e.g.,X701
)
between the two@
characters, you need[^@]*
or[^@]+
; e.g.,% egrep '^0 @[^@]*@' /tmp/XYZ % perl -CSD -p -i -e 's:^0 @([^@]*)@ .*:ZYX 1:g;' /tmp/XYZ
Use
+
if you must have at least one non-@
character
between the two@
characters.
Don’t use@
unless plain@
fails.- To avoid matching lines that have a third
@
, use another[^@]*
to specify that the rest of the line is characters other than@
.% egrep '^0 @[^@]*@ [^@]*$' /tmp/XYZ