programming

thrackle.org alive again

March 2, 2010 internet, programming No comments

My thrackle.org website is alive again. It’s about a nice math problem that I worked on 10 – 18 years ago.

webcrawlers desperate for content

January 27, 2010 internet, programming No comments

I recently found this in the web server logs of one of the websites I look after:

38.100.8.50 - - [26/Jan/2010:05:01:44 -0800] "GET /application/json HTTP/1.1" 404 763 "-" "panscient.com"
38.100.8.50 - - [26/Jan/2010:05:01:47 -0800] "GET /following-sibling::* HTTP/1.1" 404 763 "-" "panscient.com"
38.100.8.50 - - [26/Jan/2010:05:01:55 -0800] "GET /AppleWebKit/ HTTP/1.1" 404 763 "-" "panscient.com"
38.100.8.50 - - [26/Jan/2010:05:01:58 -0800] "GET /following-sibling::* HTTP/1.1" 404 763 "-" "panscient.com"

In case you are not familiar with web server log files, these line mean is that someone/something from IP address 38.100.8.50 requested the pages named after “GET” on the website, for example, a page named “following-sibling::*” etc.

Does it need to be said that no such pages exist (that’s what the “404” indicates)?

When I saw this I was rather puzzled; and looked up panscient.com (the last item on each line). Their home page says they provide some kind of vertical search service, whatever that is. On their FAQ page, I found this:

Why is your web crawler trying to access pages that don’t exist on my website?

Our web crawler attempts to extract links to valid web pages from javascript and other scripting languages. The crawler may misinterpret the information in these scripts and request a page that does not actually exist. These requests are attempts to retrieve valid web content, and are not an attempt to circumvent your webserver security.

(Emphasis mine) Oh ok. They are looking into javascript files on the web site and attempting to extract names of pages that might have content for the “vertical search”. But not successful in this case. As a web developer, I can tell you that javascript files very rarely contain interesting links to web pages.

Looks like a pretty competitive business when people start pulling at straws like this. Also I take it bandwidth is easier to come by than crawling software that avoids such silly attempts.

cryptography: a note on cipher block chaining

July 25, 2009 programming No comments

I’ve been looking into encryption methods recently, and came across this little surprise about cipher block chaining, or CBC, as it is used for block ciphers.

Block ciphers only encrypt messages of a fixed length, which depends on the cipher. To encrypt longer messages one breaks them up into blocks with the block cipher’s length and then individually encrypts these blocks. The receiver decrypts all the encrypted blocks and pastes the original message together. So for example, if your message is 2 kilobytes long (one ordinary page of writing), and the block cipher length is 32 bytes, then 2 kilobytes / 32 bytes = 2 * 1024 / 32 = 64 blocks of 32 bytes each will be encrypted. (Padding may or may not be necessary)

The idea of cipher block chaining is that if such a long message contains identical blocks, or two messages contain identical blocks, then you can tell that from the encrypted parts: they will be the same. Whoever has access to the encrypted message, and if they know the block cipher employed, then they can extract these blocks. While they cannot decrypt the individual blocks, they can compare them. Such is the world of cryptography that there are cases where it should be made difficult to tell that one message contains parts of a different message, or repeats itself.

Cypher Block Chaining

One solution, and the most commonly used “mode of operation” for a block cipher (see 1 , 2 , 3 ) is called Cipher Block Chaining. The idea is to introduce an additional block, called “initial vector”. This block is XOR-ed with the first block to be encrypted. The result is encrypted, and yields the first encrypted block to be sent. This block is however also XOR-ed with the next block to be encrypted. The result is encrypted, and yields the second encrypted block to be sent, and so on. Let’s generalize, and describe more accurately:

Suppose our numbering is such that the first block has number 1 (not 0 as is common).

  • Let P(i) be the i-th block of the plain text message.
  • Let E(X) be the result of encrypting the (plain text) block X.
  • Let D(Y) be the result of decrypting the (encrypted) block Y.
  • Let C(i) be the i-th encrypted (cipher) block.

Then encryption with Cipher Block Chaining can be formalized as:

C(0) := IV, the initial vector
C(i) := E( P(i) XOR C(i-1))

If the receiver knows the initial vector as well as the block cipher’s encryption key they can completely decrypt the message. Decryption is formalized like this:

C(0) := IV, the initial vector
P(i) := D( C(i) ) XOR C(i-1)

Decrypting with a Different Initial Vector

Finally I can point out what surprised me: it is that when decrypting, the blocks P(2), P(3), P(4), and so on do not depend on the initial vector IV that was used for encryption! Only P(1), the first decrypted block, depends on IV, while the other parts of the decrypted message will be the same regardless of IV.

In this way, the contribution of the initial vector is very different from the encryption key! And it is rather nice to see that it need not be any stronger, since it provides the function it is designed for: to hide the information about identical blocks.

And so, if the message is prepended by the the encrypter with some arbitrary initial block, the receiver does not need to know the initial vector used for encryption. After decrypting with some arbitrarily chosen initial vector (all 0’s, for example) they can just throw away the first block; the remaining blocks will represent the encrypted message.

Sample Code with AES and openssl

Here is some rather simple code to illustrate the effect. It is based on one of the Rijndael block ciphers, AES-256 (see Advanced Encryption Standard), and the openssl libary. The openssl options for  enc, “symmetric cipher routines”, are available through man enc

echo "The symmetric cipher commands allow data to be encrypted or decrypted using various block and stream ciphers" > msg.in
# Encrypt msg.in with some key and an initial vector
openssl enc -aes-256-cbc -K 1234567890123456 -iv 1234567890123456 -in msg.in -out msg.crypt
echo Decrypt with both the right key and the right iv
openssl enc -d -aes-256-cbc -K 1234567890123456 -iv 1234567890123456 -in msg.crypt
echo Decrypt with the right key but a different iv
# Pipe into 'od -cx' because there will likely be non-displayable characters. msg.crypt is a properly binary file
openssl enc -d -aes-256-cbc -K 1234567890123456 -iv ABCDEF1234560FED -in msg.crypt | od -cx
echo Compare with the output with the right key and the right iv
openssl enc -d -aes-256-cbc -K 1234567890123456 -iv 1234567890123456 -in msg.crypt | od -cx

When executed in a UNIX shell, and all the required programs are available, the output is:

Decrypt with both the right key and the right iv
The symmetric cipher commands allow data to be encrypted or decrypted using various block and stream ciphers
Decrypt with the right key but a different iv
0000000 355 221 334   J 327   =   V 326   e   t   r   i   c       c   i
        91ed 4adc 3dd7 d656 7465 6972 2063 6963
0000020   p   h   e   r       c   o   m   m   a   n   d   s       a   l
        6870 7265 6320 6d6f 616d 646e 2073 6c61
0000040   l   o   w       d   a   t   a       t   o       b   e       e
        6f6c 2077 6164 6174 7420 206f 6562 6520
0000060   n   c   r   y   p   t   e   d       o   r       d   e   c   r
        636e 7972 7470 6465 6f20 2072 6564 7263
0000100   y   p   t   e   d       u   s   i   n   g       v   a   r   i
        7079 6574 2064 7375 6e69 2067 6176 6972
0000120   o   u   s       b   l   o   c   k       a   n   d       s   t
        756f 2073 6c62 636f 206b 6e61 2064 7473
0000140   r   e   a   m       c   i   p   h   e   r   s  n  
        6572 6d61 6320 7069 6568 7372 000a
0000155
Compare with the output with the right key and the right iv
0000000   T   h   e       s   y   m   m   e   t   r   i   c       c   i
        6854 2065 7973 6d6d 7465 6972 2063 6963
0000020   p   h   e   r       c   o   m   m   a   n   d   s       a   l
        6870 7265 6320 6d6f 616d 646e 2073 6c61
0000040   l   o   w       d   a   t   a       t   o       b   e       e
        6f6c 2077 6164 6174 7420 206f 6562 6520
0000060   n   c   r   y   p   t   e   d       o   r       d   e   c   r
        636e 7972 7470 6465 6f20 2072 6564 7263
0000100   y   p   t   e   d       u   s   i   n   g       v   a   r   i
        7079 6574 2064 7375 6e69 2067 6176 6972
0000120   o   u   s       b   l   o   c   k       a   n   d       s   t
        756f 2073 6c62 636f 206b 6e61 2064 7473
0000140   r   e   a   m       c   i   p   h   e   r   s  n  
        6572 6d61 6320 7069 6568 7372 000a
0000155

As you can see only the first few bytes differ when using the "wrong initial vector".

Just for future reference, here is my system information when running the above code:

$ uname -a
Linux myosin 2.6.24-19-generic #1 SMP Wed Aug 20 22:56:21 UTC 2008 i686 GNU/Linux
$ bash --version
GNU bash, version 3.2.39(1)-release (i486-pc-linux-gnu)
Copyright (C) 2007 Free Software Foundation, Inc.
$ openssl version
OpenSSL 0.9.8g 19 Oct 2007