Yes, I know
How to catenate files scattered through several sub-directories

Yesterday, I needed to catenate all the CSV files from some dataset. The trouble was they were scattered into many different directories on several levels. My initial reflex was to use the find command—​but I then wondered if there were other ways of doing that?


The dataset

I can’t distribute the data set I was working on. But here are few commands to approximate its layout, that way you will have a better idea of the problem:

for i in {a..e}; do
  mkdir -p dataset/$i
  echo "$i: I don't want that" > dataset/$i/head.txt
  echo "$i,I,want,that" > dataset/$i/head.csv
  for j in {1..5}; do
    mkdir dataset/$i/$j
    echo "$i$j,I,want,that,too" > dataset/$i/$j/data.csv

Using the find command

I like the find command. I really like it: it can solve so many different problems! Here, it was pretty straightforward to use:

find dataset/ -name '*.csv' -exec cat {} \;

Or better, if your find command supports that alternate syntax:

find dataset/ -name '*.csv' -exec cat {} +

The difference is, in the second form, only one instance of the cat command is created, with all filenames passed as arguments of the same instance. Whereas with the former syntax, the find command will spawn a new cat instance for each found file.

Using the grep command

I am probably not the first one to discover that, but I must admit I’m particularly proud of having thought to this one:

grep -lr --include '*.csv' '' dataset/

You may already know the -r option of grep to recursively search into directories. And if you are using GNU grep, you might be aware of the --include option that limits the search space to the files matching a given glob pattern. The trick here is to use the empty '' search pattern. It will match all lines of all files of the search space. So grep will display all lines of each file matching the *.csv glob pattern. Exactly what I wanted.

Note: the -l option is used to remove the filename from the default grep output since I’m only interested in the file content.

Using the Bash extended glob patterns

When the globstar option is enabled, the Bash supports the ** glob pattern that will match zero or more directories. This is not a Bash-specific feature since Zsh and the Korn Shell (ksh) (at least) have a similar feature. But I’m less familiar with them, so I will stick with the Bash syntax:

(shopt -s globstar; cat dataset/**/*.csv)

Here I use the parenthesis to fork a new sub-shell that will execute the commands give inside the parenthesis. That way, I can alter the shell options in the child process without interfering with the settings of the parent shell. Speaking of that, this is the role of the shopt Bash internal command to set the shell options, in that case, to activate the globstar feature. Now, I can use an extended glob pattern to pass as arguments to the cat command all .csv files, including those buried in subdirectories (*).

And this is the end of that pretty short article. I hope you learned a trick or two. If you’re interested in knowing more about the find command or the globstar Bash option, I may suggest you take a look at that series of three videos I published earlier: