Problem :
I have 2 old similar directory trees with MP3 files in them. I am happily using tools like diff and Rsync to identify and merge the files that are only present on one side, or are identical, I’m left with a bunch of files that are bitwise different.
On running diff over a pair actually different files, (with -a tag to force text analysis) it produces incomprehensible gibberish. I have listened to files from both sides, and they both seem to play fine (but at nearly 10 minutes per song, when listening to them twice each, I haven’t done many)
I suspect the differences are due to some player in the past “enhancing” my collection by messing about with ID3 tags, but I can’t be certain. Even if I identify differences in ID3 tags, I would like to confirm that no cosmic ray or file copy error issues have damaged any of the files.
One method that occurs to be is finding the byte locations of the differences, and ignoring all changes in the first ~10kb of each file, but I don’t know how to do this.
I have on the order of a hundred or so files that differ across the directory tree.
I found How to compare mp3, flac audio data in a file, ignoring header data (ID3 tag) etc.? — but I can’t run alldup due to being Linux only, and from the sounds of it, it would only partially solve my issues anyway.
Solution :
Beyond Compare according to topic?
Beyond Compare 3 does not run as a console application on Linux. It
requires X-Windows.SUPPORTED LINUX DISTRIBUTIONS
Red Hat Enterprise Linux 4-6
Fedora 4-14
Novell Suse Linux Enterprise Desktop 10
openSUSE 10.3-11.2
Ubuntu 6.06-10.10
Debian 5.04
Mandriva 2010
Beyond Compare (referenced above) looks like a great solution. I’ve never used it. The bit about Xwindows just means that it wants to run in a gui, not straight command line. If you have a gui installed, then the chances that Xwindows is already properly installed on your system are extremely good.
Some ideas on how to proceed:
cmp -i 10kB file1 file2
will compare bytewise two arbitrary files on Linux, first skipping 10kb on each file. It even has an option for skipping different byte counts on each file. The -b parameter will print out differing bytes, but that might be a very long output, so if you use it, pipe the output into a file or into less. You’d have to decide how many bytes to skip. I don’t know that answer. To use it effectively for multiple files, you’d have to write a script in bash or another language. Maybe running it as part of a find command with an exec option would work.
In the future, if looking for duplicate files, check out fdupes. It’s a utility designed just for that. I used it when I was still figuring out how to manage photos on my computer and ended up with a bunch of directories with lots of duplicates in them.
https://code.google.com/p/fdupes/
Also, if you look up fdupes on wikipedia, there’s a whole raft of Linux file compare programs listed in the entry.
Just for the heck of it, I had a look at:
http://www.id3.org/id3v2.4.0-structure
which specifies the structure of id3 tags. It “recommends” that the tags be placed at the start of the file, but also provides for additional tags to be added at the end of the file, so unless nobody uses that option, there may be meta information elsewhere in the file, not just at the beginning. A cursory look at the spec reveals that id3 tag info is variable in length, so there would be no exact byte count that would be guaranteed to skip over it, but 10k as originally suggested ought to be way more than enough to skip the initial tags.
As possible solution you may use any tool to convert file into uncompressed stream (pcm
, wav
) without metadata info and then compare it. For conversion you may use any software you have like ffmpeg
, sox
or avidemux
.
For example how I do that with ffmpeg
Say I have for that example 2 files with different metadata:
$ diff Original.mp3 Possible-dup.mp3 ; echo $?
Binary files Original.mp3 and Possible-dup.mp3 differ
Brute force comparison complain they are differ.
Then we just convert and diff body:
$ diff <( ffmpeg -loglevel 8 -i Original.mp3 -map_metadata -1 -f wav - ) <( ffmpeg -loglevel 8 -i Possible-dup.mp3 -map_metadata -1 -f wav - ) ; echo $?
0
Off course ; echo $?
part is just for demonstration purpose to see return code.
Processing multiple files (traverse directories)
If you want try duplicates in collection it have worth to calculate checksums (any like crc
, md5
, sha2
, sha256
) of data and then just find there collisions.
- First calculate hash of data in each file (and place into file for next processing):
for file in *.mp3; do printf "%s:%sn" "$( ffmpeg -loglevel 8 -i "$file" -map_metadata -1 -f wav - | sha256sum | cut -d' ' -f1 )" "$file"; done > mp3data.hashes
For you case you may compare just multiple directories, f.e.:
find -L orig-dir dir-with-duplicates -name '*.mp3' -print0 | while read -r -d $' ' file; do printf "%s:%sn" "$( ffmpeg -loglevel 8 -i "$file"" -map_metadata -1 -f wav - | sha256sum | cut -d' ' -f1 )"" ""$file""done >