Problem :
I have a directory containing around 500k files, and want to slice them into t
tar files.
Put formally, let’s call the files file_0, ..., file_{N-1}
, where N
around 500k. I want to create t
tar files each containing T=N/t
files, where the i-th tar file contains
file_(i*N), ..., file_((i+1)*N - 1), i in {0, ..., t-1}
What’s an efficient way to do this? I was going to write a Python script that just loops over the N
files and divides them into t
folders, and then calls tar
in each, but this feels very unoptimal. I have many cores on the server and feel like this should happen in parallel.
Solution :
You can use python concurrent
library which is designed to process a request queue among all or some threads, eating the queue until all jobs are entirely executed.
- Generate a big list of list of files, like
[ [f0..f0-1], [fn..f2n-1]..]
- Use a
ThreadPoolExecutor
to eat this list with all many thread your computer has. This can look like this:
import os
import sys
from concurrent.futures import ThreadPoolExecutor
import subprocess
import itertools
import math
def main(p, num_tar_files):
files = list(split_files_in(p, num_tar_files))
tar_up = tar_up_fn(p)
with ThreadPoolExecutor(len(files)) as executor:
archives = list(executor.map(tar_up, itertools.count(), files))
print("n {} archives generated".format(len(archives)))
def split_files_in(p, num_slices):
files = sorted(os.listdir(p))
N = len(files)
T = int(math.ceil(N / num_slices)) # means last .tar might contain <T files
for i in range(0, N, T):
yield files[i:i+T]
def tar_up_fn(p):
def tar_up(i, files):
_, dir_name = os.path.split(p)
tar_file_name = "{}_{:05d}.tar".format(dir_name, i)
print('Tarring {}'.format(tar_file_name))
subprocess.call(["tar", "-cf", tar_file_name] + files, cwd=p)
return tar_file_name
return tar_up
if __name__ == '__main__':
main(sys.argv[1], int(sys.argv[2]))
using zsh to create lists for tar
Hope I understood what your trying to do. t=731
was just a number I picked out of the air. Hack as needed. The following creates multiple files with t
file names or the last file with the remaining file names if not equal to t
.
Var=(*(.)) # glob files in current directory
VarSorted=(${(on)Var}) # numeric sort
fn=1 # Tar list file number
t=731 # Number of files in each tar file
for (( i = 1 ; i <= ${#VarSorted} ; i = i + t ))
do
print -l -- ${VarSorted[$i,$i+$t-1]} > /tmp/tar_file_list_${(l:5::0:)fn}
(( fn++ ))
done
Use the tar
command’s -t
or --files-from
(short/long form) option to generate each tar file. This too can also be scripted.