Friday, June 27, 2014

Parallel grepping a large code base with progress display (find|pv|xargs -P)


xargs can be used with parallel mode: -P to make best use of current multi-core CPU. When grepping a large code base for some keyword, it is much quicker if we can make full use of the cores. But of course the overall speed not only depends on parallel grepping but also the speed of the storage.

For a 4-core computer, you can easily accelerate your grepping speed by using find with 4 parallel grep, each with 10 files at a time, at current folder:

find . -type f | xargs -P 4 -n 10 grep -Hn "what_to_grep"

Make it an shell function and you're good to go:

function ppg { find . -type f | xargs -P 4 -n 10 grep --color=always -Hn "$1" } $ ~/build_trees/android-current/kernel
$ ppg ehci_resume
./drivers/usb/host/ehci-spear.c:53: ehci_resume(hcd, false);
./drivers/usb/host/ehci-pci.c:365: if (ehci_resume(hcd, hibernated) != 0)
./drivers/usb/host/ehci-msm.c:175: ehci_resume(hcd, false);
....
(of course you can alternatively use some other search tools by indexing the code)

Now what if you want to see the progress? When you consider using parallel grepping, it often means it takes a long time to scan every file for what you want.

A good tool "pv" is already available. However, for pv to make sense to you, you should give it a hint of how large your data set is. In other words, you have to count how many files are there to grep.

To recursively count files under a folder, a typical and straightforward way is to use find with wc -l, with the first part being identical to what is to be fed into xargs in the first example:

find . -type f | wc -l
For instance, the total file count under frameworks/ folder of an android tree is about 30k, which takes about 5s on my notebook:

$ time find ~/trees/android/frameworks -type f | wc -l
30503

real 0m4.950s

Combining everything together, we make it another function:

function ppgv { find . -type f | pv -cN GREP -i 0.5 -l -s `find . -type f | wc -l` |xargs -P 4 -n 10 grep --color=always -Hn "$1" }
When searching something that doesn't exist on a kernel folder, the one with pv takes a little bit longer:

$ time ppg doesnt-exist1

real 0m1.168s
user 0m0.140s
sys 0m0.380s

$ time ppgv doesnt-exist2
     GREP: 45.3k 0:00:01 [  39k/s] [================>] 100%            

real 0m1.339s
user 0m0.500s
sys 0m0.524s

While we can benefit by the progress bar and also ETA, the extra cost doesn't seem to be too much.

Tuesday, June 24, 2014

Mixing variables in single/double quotes


Say you have a task to find files under a directory, and then execute another command with the result. This is an easy case. You find-exec to achieve it:

find <path> -name 'what_to_find' -exec <what_ever_command_to_execute> \; 

Oh, use {} in the  <what_ever_command_to_execute> if you intend to use the just-found file.

However, if what you want to execute is a complicated bash command which deals with the files just found, it is advised to use $0 instead of {} to avoid any weird characters that may happen to be part of the filename.  What weird characters? Filenames with double quotes, dollar signs, escape combos.. etc. So now we have this form:

find <path> -name 'what_to_find' -exec bash -c <your complicated commands with $0 > {} \;

If your command includes bash functions, then it is not possible to use the previous form. Use bash -c instead.

Then problem comes. If this 'find' command is to be embedded in a bash script, chances are you will have variables which define paths to find. What's worse, variables to be passed to the command to be executed by find. See below example:

find $WHERE_TO_FIND -name $WHAT_TO_FIND -exec bash -c <your complicated commands with $0 and $ANOTHER_VAR> {} \;

Now, $WHERE_TO_FIND, $WHAT_TO_FIND, and $ANOTHER_VAR are to be expanded before running the find command. But $0 is to be passed to 'bash -c' which should not be expanded before running find!

The key to the solution would be mixing single quotes with double quotes. Just remember one thing: variable expansion won't happen inside single quotes. Single quote won't expand variables, while double quotes will. Mixing single quotes with double quotes are easy: just stick them altogether:

$ echo "this is inside double"' while this is inside single'
this is inside double while this is inside single

By default shell will expand variables. So you can add variables too:

$ COMMA=,
$ echo $COMMA
,
$ echo "this is inside double"$COMMA' while this is inside single. $COMMA will not expand.'
this is inside double, while this is inside single. $COMMA will not expand.


This entire string will be treated as one word by shell, and then passed as the first parameter to 'echo' command.

And you can insert unmatched single quote in a double quote, or vice versa:

$ QUOTES="I'm"
$ echo $QUOTES
I'm

$ QUOTES=' a double quote"'
$ echo $QUOTES
 a double quote"

Mixing them together will be like this:

$ QUOTES="I'm"' a double quote"'
$ echo $QUOTES
I'm a double quote"

Now we have everything we need.

Here comes a practical example: use adb shell to copy all *.gcda files under /sys/kernel/debug to a specific folder, using 'cat' instead of tar or cp. This is used to copy GCOV kernel coverage data from an Android device.

It comes with another hierarchy: adb shell 'command to run on android device'. You have to make sure the command as a whole is passed as a single word to adb shell. Final working command would be:

adb shell "busybox find $GCDA -name '*.gcda' -exec sh -c '"'cat < $0 > '$TEMPDIR'/$0'"' {} \;"

Let's see how those mixing single- and double- quotes work step by step:

  1. we feed in $0 (the result file path by find) to stdin of cat and save the stdout to the same path under $TEMPDIR on the android device. We don't want to resolve $0 now (because it will be this script's filename if we do it now), so the $0 part should be quoted in single quotes. (see red part)
    adb shell "busybox find $GCDA -name '*.gcda' -exec sh -c '"'cat < $0 > '$TEMPDIR'/$0'"' {} \;"
  2. Now make sure commands to be passed to 'sh -c' is one word: (see blue single quotes)
    adb shell "busybox find $GCDA -name '*.gcda' -exec sh -c '"'cat < $0 > '$TEMPDIR'/$0'"' {} \;"
  3. Protect everything else with double quotes: (see 2 underlined sections)
    adb shell "busybox find $GCDA -name '*.gcda' -exec sh -c '"'cat < $0 > '$TEMPDIR'/$0'"' {} \;"
  4. Mixing everything into one word, so that everything after adb shell will be passed to shell in the android device. The first underline is part 1 by double quotes, part 2 is the red section with single quotes, and then a variable without any quotes as part 3, then another single quote in red as part 4. Finally the second underline section as part 5 quoted in double quotes.
    adb shell "part 1"'part 2'$part_3'part 4'"part 5"

This works as expected:
  1. adb shell sends the actual command to run on android device, 
  2. find in busybox is to be used to search for all *.gcda files under $GCDA folder
  3. and passes each result path to sh
  4. sh then runs cat, which reads the just-found file represented as $0 and saves each bit into the same path under $TEMPDIR on the android devices