Solved Offset of a sequence of null bytes in a file

Emrion · Saturday at 7:56 PM

Hello all,

I'm looking for a means to get the offset of a sequence of null bytes (say 16 for instance) present in a file, and which works with /bin/sh.

I almost found how to do with grep, but it seems it's not possible to put a zero byte in its regexp. I'm not obliged to use grep. The only requisite is the used commands must be present in the FreeBSD base.

An idea, anyone?

covacat · Saturday at 8:18 PM

pipe file thru tr and replace \0 with something else then grep

covacat · Saturday at 8:21 PM

or a shell script reading output of hexdump and count consecutive 0s
this might be slow but will do
you also get the offset

Kai Burghardt · Saturday at 10:15 PM

For a file shorter than getconf LONG_MAX bytes you can concoct a shell script utilizing cmp(1), something like this:

Bash:

#!/bin/sh
# description: Prints the 1‐based offset of the first run of
#              `${zeros_string_length:-16}` zero‐bytes in input (if any).
#      author: Firstname Lastname <mailbox@host>
{
	cat "${1:--}"
	# Sentinel value so `cmp` prints a difference line at the end.
	printf '\n'
} | {
	# Prints lines
	# 
	#   1. 1‐based offset
	#   2. first file’s byte value at offset
	#   3. second file’s byte value at offset
	# 
	# for each differing byte.
	cmp -l - /dev/zero 2>&-
} | {
	while read -r current sample zero dummy
	do
		# Leading `+` is harmless if `${previous}` unset.
		if [ ${current:?} -gt $((${previous} + ${zeros_string_length:-16})) ]
		then
			printf '%d\n' $((${previous} + 1))
			exit
		fi
		previous=${current:?}
	done
}

I hope I’ve covered all edge cases, but maybe double‐check ’em.

covacat said:
pipe file thru tr and replace \0 with something else then grep

This needs prior knowledge of the file’s contents so you choose a suitable comparison string.

_martin · Saturday at 10:45 PM

I personally use bvi. But you did put constraint in about it being in base.

xxd(1) and grep(1) can help here.

Example:

Code:

$ perl -e 'print "hello" . pack("QQQQQ", 0xcafec0de, 0xdeadbeef, 0,0, 0xf00d1337 )' | hd
00000000  68 65 6c 6c 6f de c0 fe  ca 00 00 00 00 ef be ad  |hello...........|
00000010  de 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020  00 00 00 00 00 37 13 0d  f0 00 00 00 00           |.....7.......|
0000002d

Match 16 0x0 bytes in above example:

Code:

$ perl -e 'print "hello" . pack("QQQQQ", 0xcafec0de, 0xdeadbeef, 0,0, 0xf00d1337 )' | xxd -ps -c 0 | grep -bo "00000000000000000000000000000000"
34:00000000000000000000000000000000

Which is offset dec 34 / 2 = 17 (0x11 in hd output). You do need to divide by 2 as -p stringifies the output.

covacat · Saturday at 11:54 PM

Kai Burghardt said:
This needs prior knowledge of the file’s contents so you choose a suitable comparison string.

you can change A with B then \0 with A and you don't need to worry about this

Emrion · Sunday at 9:09 AM

First, thanks to all of you.

covacat said:
pipe file thru tr and replace \0 with something else then grep

I wished, I really wish it worked, but:

The tr utility has historically not permitted the manipulation of NUL
bytes in its input and, additionally, stripped NUL's from its input
stream. This implementation has removed this behavior as a bug.

All my tests show that it ignores any NULL bytes in its processing.

Kai Burghardt said:
For a file shorter than getconf LONG_MAX bytes you can concoct a shell script utilizing cmp(1), something like this:

Your script works, but it's complicated for what I want. Anyway, that the only code that works. Maybe should I try to understand it in order to simplify it for my use case.

_martin said:
I personally use bvi. But you did put constraint in about it being in base.

xxd(1) and grep(1) can help here.

xxd isn't part of the base. So, this is a no-go for me.

_martin · Sunday at 9:28 AM

Emrion said:
xxd isn't part of the base. So, this is a no-go for me.

Ouch, I had to overlook this when I was checking it. (cheked od, assumed xxd or something). My bad. hexdump(1) is in base:

Code:

$ perl -e 'print "hello" . pack("QQQQQ", 0xcafec0de, 0xdeadbeef, 0,0, 0xf00d1337 )' | hexdump -v -e '/1 "%02x"' | grep -bo "00000000000000000000000000000000"
34:00000000000000000000000000000000

ralphbsz · Sunday at 9:41 AM

It will be quite hard to find a text-based tool (like sed, grep, tr, awk ...) that knows to work with nul bytes. The reason is that for all string and text processing on Unix, the nul byte marks the end of a string (or a line).

How about this suggestion: A really simple C program, which copies characters one at a time from stdin to stdout, turning every nul byte into the letter "0", and every non-null byte into the letter "X"? Then compile and link it (that's a single command, cc foo.c -o foo), and run it in a pipeline. Having done that conversion, you can use grep to find a sequence of 16 "0" in a row. If grep has a problem with the fact that the input string is a single "line" with no terminating newline, add a putchar('\n') at the end, before the return.

Code:

#include <stdio.h>
int main(int argc, char* argv[]) {
  char c;
  while ((c = getchar()) != EOF) {
    if (c == '\0')
      putchar('0');
    else
      putchar('X');
  }
  return 0;
}

covacat · Sunday at 9:44 AM

tr works
head -c 123 /dev/zero |tr '\0' '\n'|wc -l
123

_martin · Sunday at 9:56 AM

I don't understand why you keep reinventing wheel here. All necessary tools are in base. If you need to code something you could avoid that and go for xxd from ports (constraint from op not to).
hexdump/grep will do.

Emrion:
Small check to verify on larger binary (find offset of "/bin/sh" in libc and verify with gdb that offset/2 is what it is):

Code:

$ hexdump -v -e '/1 "%02x"' /lib/libc.so.7 | grep -bo $(echo -n "/bin/sh" | hexdump -v -e '/1 "%02x"') | awk -F: '{print "gdb -q -ex \"x/s", $1/2 "\" -ex \"q\" /lib/libc.so.7"}'|sh
Reading symbols from /lib/libc.so.7...
(No debugging symbols found in /lib/libc.so.7)
0x45370:    "/bin/sh"

*update: the gdb trick here could be a little bit confusing. it works for this situation but could confuse people if they try to match something else (due to gdb).

the hexdump/grep is doing what you need though

Side note: for actual string operations you could help yourself with just strings (I know, not what you're after in this thread):

Code:

$ strings -a -t x /lib/libc.so.7 |grep "/bin/sh"
  45370 /bin/sh

Emrion · Sunday at 10:23 AM

covacat said:
tr works
head -c 123 /dev/zero |tr '\0' '\n'|wc -l
123

You're right. I forgot the quotes around \0.

But now, I'm facing to another problem. The tr substitution method and the script of Kai Burghardt don't give the same value on a file. More over both seem false. I need to take time to clear that.

Edit: for the tr method, it's due to the behavior of grep that treats the file by line. It points always a byte just after a new line character. The line contains the number of zero wanted before to reach another new line which may be hundred bytes (or more) farther.

Emrion · Sunday at 4:53 PM

Ok. I'm going to explain all this business. It's about some new developments in loaders-update.

Among others things, this script tests the content of a freebsd-boot partition in order to know if it contains gptboot or gptzfsboot. I wasn't and I'm still not satisfied the way it operates. The current distributed version just tests if it finds the strings 'ZFS' and 'zfs' (with grep). If not, it's gptboot; a questionable logic.

I added the detection of a string that belongs to the BTX loader. So, it will find if the partition has never been filled at all with a FreeBSD BIOS loader. Fine, but after several tests I found a bug, a very special case...

What about if a partition has been filled first with gptzfsboot, then filled with gptboot in a second time? As the size of gptboot is very inferior to the one of gptzfsboot, the strings 'ZFS' and 'zfs' remain in the partition and it's falsely detected as gptzfsboot.

Hence the idea to detect the end of the file: an area with many 0 bytes. I compare it with the offset where 'ZFS' has been localized. If it's inferior, it means we are in this special case and it is gptboot in fact.

For the moment, I stick with the covacat solution because it's the simplest and the quickest:
eof=$(cat "$p" | tr '\0' K | grep -b -m1 -E 'K{24}' | cut -f1 -d ':')
$p is the partition to examine.

I transformed the Kai Burghardt script in a simplified function that works as well, but it's slower (and I didn't understand why it also does not give the right offset):

Code:

# DetectEOF FileToExamine NumberOfZeroWanted
DetectEOF()
{
    local current previous
    unset previous

    { cat "$1" | cmp -l - /dev/zero 2>&-
} | {    while read -r current sample zero
    do
                if [ ${current} -gt $((${previous} + $2)) ]; then
                         printf '%d\n' $((${previous} + 1))
                         return
                fi
                previous=${current}
        done
     }
}

ralphbsz said:
How about this suggestion: A really simple C program, which copies characters one at a time from stdin to stdout, turning every nul byte into the letter "0", and every non-null byte into the letter "X"? Then compile and link it (that's a single command, cc foo.c -o foo), and run it in a pipeline. Having done that conversion, you can use grep to find a sequence of 16 "0" in a row. If grep has a problem with the fact that the input string is a single "line" with no terminating newline, add a putchar('\n') at the end, before the return.

That will be the best solution because I realize that grep isn't good for binary file. It is not only for a sequence of zero, but for all strings detection. The problem is that supposes to compile a part of the port. I don't want to go in this way for the moment.

Maybe a simplest alternative would be to add a dependency with a tool that do well the job and designed for that: retrieve the offset of a string in a binary file. Anyone knows a suited port?
I admit, I don't like to be dependent of another software.

_martin · Sunday at 5:32 PM

Emrion said:
retrieve the offset of a string in a binary file.

Because hexdump and grep is not working for you or .. ?

Emrion · Sunday at 5:47 PM

_martin said:
Because hexdump and grep is not working for you or .. ?

It's grep the problem. Its result depends on the presence of x0a bytes in the file.

_martin · Sunday at 5:50 PM

Can you give me your failing example ?
Because when I do this:

Code:

$ perl -e 'print "hello" . pack("QQQQQ", 0xcaf0a0a0a, 0xdeadbeef, 0,0, 0xf00d1337 )' |hd
00000000  68 65 6c 6c 6f 0a 0a 0a  af 0c 00 00 00 ef be ad  |hello...........|
00000010  de 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020  00 00 00 00 00 37 13 0d  f0 00 00 00 00           |.....7.......|
0000002d

I can still match the pattern just fine:

Code:

 perl -e 'print "hello" . pack("QQQQQ", 0xcaf0a0a0a, 0xdeadbeef, 0,0, 0xf00d1337 )' | hexdump -v -e '/1 "%02x"' | grep -bo "00000000000000000000000000000000"
34:00000000000000000000000000000000

Still matches the offset as mentioned - 34/2 = 17 or 0x11.

Emrion · Sunday at 6:57 PM

Good! Seems to work, but hd is useless. It's the -o option of grep that changes all. I have to add a filter like head -n1 to have only the first number (as option -m is inactive with -o). No division needed without hd:

cat "$p" | tr '\0' K | grep -bo -E 'K{24}' | head -n1 | cut -f1 -d ':'

_martin · Sunday at 7:14 PM

There's more ways to skin a cat, you can use any solution you see fit here.

Emrion said:
but hd is useless.

I beg to differ; it's very powerful. Personally I don't like the tr solution at all; it's dirty and workaround-ish. I'm not sure your example is actually proper as there could be K bytes in the stream; I don't see you're applying covacat's advice there.

hexdump gives you string of hex data upon which grep can search and print offset. It doesn't get easier than that. Additional work is needed to divide the result /2 but it's still cheaper (system resources wise) than doing two forks with head and cut.

Anyway, what you use is up to you. I wanted to point out you do have all the tools in base to solve your (and potentional future googlers) problem.

ralphbsz · Sunday at 11:12 PM

Philosophical question: What scripting languages are in base? Python isn't. I think perl isn't either. Clearly the modern/fancy/bizarre ones (like Ruby, Rexx, JS) aren't. Is it really just awk and the shells?

By scripting language I mean: Something in which you can write sensible programs, with if statements, loops, file IO, binary IO, string processing, and some printf-like functionality. Awk barely qualifies, as it is so much built around the paradigm of lines of text and filtering with a series of patterns.

gpw928 · Monday at 12:23 AM

With LC_CTYPE=C

Code:

# Translate all non-null characters into the letter "Z".
# Translate all null characters into the letter "A".
# Emit the string preceeding the first run of 16 "A" characters.
# Count the characters in the string emitted.
tr '\001'-'\377' Z \
    | tr '\000' A \
    | sed -e 's/AAAAAAAAAAAAAAAA.*//' \
    | wc -c

Emrion · Monday at 9:12 PM

There is a problem with tr in the versions 14.1-RELEASE and 13.3-RELEASE (even in an old 15-CURRENT). Trying to pipe thru tr a binary file results in:

tr: Illegal byte sequence

To reproduce this (in the mentioned RELEASEs): cat /boot/gptboot | tr a b

I suppose that since 14.2 a corrective patch has been applied to tr (or a lib it uses).

The function of Kai Burghardt is very time consuming: one or two seconds per partition.

So, I'm going to use hd as already mentioned by _martin. I don't like to divide by two because sometimes the resulting offset is odd (a trailing zero of the byte before the zeros, I suppose) but, after test, this has no importance for the aim of this program.

Code:

eof=$(cat "$p" | hexdump -v -e '/1 "%02x"' | grep -bo "000000000000000000000000000000000000000000000000" | head -n1 | cut -f1 -d ':')
eof=$((eof/2))

It's not really elegant, but I can improve the thing. Anyway, it works and it works quickly.

gpw928 · Monday at 9:37 PM

Emrion said:
There is a problem with tr

See LC_CTYPE=C, in #20 above.

Code:

[gunsynd.153] $ ls -lad /boot/gptboot
-r--r--r--  1 root  wheel  61922 Jan 18 09:53 /boot/gptboot
[gunsynd.154] $ time (
    export LC_CTYPE=C
    cat /boot/gptboot \
        | tr '\001'-'\377' Z \
        | tr '\000' A \
        | sed -e 's/AAAAAAAAAAAAAAAA.*//' \
        | wc -c
)
   61616
    0.02s real     0.03s user     0.01s system

Emrion · Tuesday at 5:34 PM

gpw928 said:

See LC_CTYPE=C, in #20 above.

Code:

[gunsynd.153] $ ls -lad /boot/gptboot
-r--r--r--  1 root  wheel  61922 Jan 18 09:53 /boot/gptboot
[gunsynd.154] $ time (
    export LC_CTYPE=C
    cat /boot/gptboot \
        | tr '\001'-'\377' Z \
        | tr '\000' A \
        | sed -e 's/AAAAAAAAAAAAAAAA.*//' \
        | wc -c
)
   61616
    0.02s real     0.03s user     0.01s system

Ok, the problem wasn't the FreeBSD versions but an empty LC_CTYPE on those VMs. Thank you.
Your command is indeed working. But, I have now a simpler command that runs fast enough.

Code:

time cat /boot/gptboot | hexdump -v -e '/1 "%02x"' | grep -bo -E '0{48}' | head -n1 | cut -f1 -d ':'
        0,00 real         0,00 user         0,00 sys
123767

Ok, I still have to divide the result by two, but that should be ok. That enough for me. Thanks for your method as it could be of some use if something goes wrong in my tests or in a reported issue.