Problem :
How can I merge log files, i.e. files that are sorted by time but that also have multi-lines, where only the first line has the time, and the remaining ones have not.
log1
01:02:03.6497,2224,0022 foo
foo1
2foo
foo3
01:04:03.6497,2224,0022 bar
1bar
bar2
3bar
log2
01:03:03.6497,2224,0022 FOO
FOO1
2FOO
FOO3
Expected result
01:02:03.6497,2224,0022 foo
foo1
2foo
foo3
01:03:03.6497,2224,0022 FOO
FOO1
2FOO
FOO3
01:04:03.6497,2224,0022 bar
1bar
bar2
3bar
If it weren’t for the non-timestamp lines starting with a digit, a simple sort -nm log1 log2
would do.
Is there an easy way on a unix/linux cmd line to get the job done?
Edit As these log files are often in the gigabytes, merging should be done without re-sorting the (already sorted) log files, and without loading the files completely into memory.
Solution :
Tricky. While it is possible using date
and bash arrays, this really is the kind of thing that would benefit from a real programming language. In Perl for example:
$ perl -ne '$d=$1 if /(.+?),/; $k{$d}.=$_; END{print $k{$_} for sort keys(%k);}' log*
01:02:03.6497,2224,0022 foo
foo1
2foo
foo3
01:03:03.6497,2224,0022 FOO
FOO1
2FOO
FOO3
01:04:03.6497,2224,0022 bar
1bar
bar2
3bar
Here’s the same thing uncondensed into a commented script:
#!/usr/bin/env perl
## Read each input line, saving it
## as $_. This while loop is equivalent
## to perl -ne
while (<>) {
## If this line has a comma
if (/(.+?),/) {
## Save everything up to the 1st
## comma as $date
$date=$1;
}
## Add the current line to the %k hash.
## The hash's keys are the dates and the
## contents are the lines.
$k{$date}.=$_;
}
## Get the sorted list of hash keys
@dates=sort(keys(%k));
## Now that we have them sorted,
## print each set of lines.
foreach $date (@dates) {
print "$k{$date}";
}
Note that this assumes that all date lines and only the date lines contain a comma. If that’s not the case, you can use this instead:
perl -ne '$d=$1 if /^(d+:d+:d+.d+),/; $k{$d}.=$_; END{print $k{$_} for sort keys(%k);}' log*
The approach above needs to keep the entire contents of the files in memory. If that is a problem, here’s one that doesn’t:
$ perl -pe 's/n/ /; s/^/n/ if /^d+:d+:d+.d+/' log* |
sort -n | perl -lne 's/ /n/g; printf'
01:02:03.6497,2224,0022 foo
foo1
2foo
foo3
01:03:03.6497,2224,0022 FOO
FOO1
2FOO
FOO3
01:04:03.6497,2224,0022 bar
1bar
bar2
3bar
This one simply puts all lines between successive timestamps on to a single line by replacing newlines with