Manually Unpacking Bitmessage Databases

Bitmessage is an encrypted, peer to peer messaging protocol. It provides strong privacy guarantees by encrypting messages with a public key and then distributing them through the network without a recipient address. The idea is that each node will receive every message and try to decrypt it with their private keys. If a message cannot be decrypted then it wasn’t intended for your node but no one else knows so that’s how receiver anonymity is achieved. In addition to this the protocol employs proof of work to prevent spam so with all this overhead and latency, Bitmessage is more analogous to email rather than instant messaging.

The reference implementation of Bitmessage is called PyBitmessage and stores data in an SQLite file called messages.dat and may be found in a number of places including $BITMESSAGE_HOME, $XDG_CONFIG_HOME/PyBitmessage/ or $HOME/PyBitmessage/ (PyBitmessage searches for paths in this order at startup).

† sqlite3 messages.dat '.schema'
CREATE TABLE inbox (msgid blob, toaddress text, fromaddress text, subject text, received text, message text, folder text, encodingtype int, read bool, sighash blob, UNIQUE(msgid) ON CONFLICT REPLACE);
CREATE TABLE sent (msgid blob, toaddress text, toripe blob, fromaddress text, subject text, message text, ackdata blob, senttime integer, lastactiontime integer, sleeptill integer, status text, retrynumber integer, folder text, encodingtype int, ttl int);
CREATE TABLE subscriptions (label text, address text, enabled bool);
CREATE TABLE addressbook (label text, address text);
CREATE TABLE blacklist (label text, address text, enabled bool);
CREATE TABLE whitelist (label text, address text, enabled bool);
CREATE TABLE pubkeys (address text, addressversion int, transmitdata blob, time int, usedpersonally text, UNIQUE(address) ON CONFLICT REPLACE);
CREATE TABLE inventory (hash blob, objecttype int, streamnumber int, payload blob, expirestime integer, tag blob, UNIQUE(hash) ON CONFLICT REPLACE);
CREATE TABLE settings (key blob, value blob, UNIQUE(key) ON CONFLICT REPLACE);
CREATE TABLE objectprocessorqueue (objecttype int, data blob, UNIQUE(objecttype, data) ON CONFLICT REPLACE);

Yes, many of these fields are binary blobs including the primary key. This seems like a stupid design decision but we can deal with it. SQLite includes a function called hex() to turn these fields into printable ascii.

† sqlite3 messages.dat 'select hex(msgid) from inbox;' | head
0002AD7D1668A45E3893EB57F5D8B445CEA8459B1BDEF301ED8DEFC08C8C6238
00058181A31203D26B3D78C64FC8D7721BD725DDF317DBB59E7E73087247489D
0007246AC18015636F9E1CB60FDFB4A9183CC25991B8B451D8C0AD66D9E25C53
000DE9ADD8F455B361EBE638B5D062D3A7C70D9A9708BE4B5470CB4B0A8E2F9F
0013161778A84120B5AF13269FC464FD61234DBCFC60FE62071F3C2CC3BDA8C7
00134E521DA2D57AFDF32D4ABF19FEAD28685D75C574056CCC7D464A0238C260
00157E9ED99650D429B90C31BCA837983BE2378E8F647278A143592644DF330B
001B082AC8C95E8A335C218536E9A267C06B666886D536ADA19504D4B6D37477
002B24EEFBEF4DD49CF1F29F4DE6D6D89A82227A8C762B4ADCAB61B0E4AD9510
002C33FB7D5555DA7A8A338F2A2900BAF8D2F5FECA69189E0F369E8CB10E6FB1

More importantly we can invert this process to do lookups on the database by adding where hex(msg_id) = to our queries.

†  sqlite3 messages.dat "select subject from inbox where hex(msgid)='0002AD7D1668A45E3893EB57F5D8B445CEA8459B1BDEF301ED8DEFC08C8C6238';"
Re: Be advised that Protonmail's web client is only open source. Read the link below with the admonition, http://thehackernews.com/2017/07/dream-market-darkweb.html

To extract the data I want the message contents in a file, metadata in another and any embedded files extracted too.

Metadata can be extracted into shell variables like this

row=$(sqlite3 messages.dat 'select toaddress, fromaddress, received, folder, encodingtype, read, hex(sighash) from inbox where hex(msgid)='"'$msgid'"';')

toaddress=$(echo "$row" | awk -F'|' '{print $1}')
fromaddress=$(echo "$row" | awk -F'|' '{print $2}')
received=$(echo "$row" | awk -F'|' '{print $3}')
folder=$(echo "$row" | awk -F'|' '{print $4}')
encodingtype=$(echo "$row" | awk -F'|' '{print $5}')
wasread=$(echo "$row" | awk -F'|' '{print $6}')
sighash=$(echo "$row" | awk -F'|' '{print $7}')

The subject and contents can be extracted like this

sqlite3 messages.dat 'select subject from inbox where hex(msgid)='"'$msgid'"';' > "${msgid}.message"
sqlite3 messages.dat 'select message from inbox where hex(msgid)='"'$msgid'"';' >> "${msgid}.message"

I decided to use the msgid for the filenames instead of the subject because the latter could contain anything and escaping all the potential badness does not sound like fun.

Even though messages are text only, the PyBitmessage client supports displaying files embedded using html. Note that given Bitmessage does seem successful in providing some level of sender anonymity and the nature of the design makes it impossible for the network to censor messages based on content so you may want to be careful with extracting any images.

Since we’ve already extracted all messages into text files, we can leverage that to pick out the ones with embedded files without going through the database:

grep -oP '<img' output/*.message | cut -d: -f1 > files

while read file;
do
...
done < files

Some users will use mail clients to send html messages through their Bitmessage daemon and some mail clients will format messages into 80 column lines. An easy way to undo this formatting is to remove all whitespace from the message.

sed -re 's/\s//g' < "$file" > flattened

Each message may have multiple embedded files so we can extract them like this

grep -oP 'data:image/[^;]+;base64,[a-zA-Z0-9+/=]+' flattened > imgs

It is then just a matter of extracting the base64 encoded data from the html tags and decoding it.

msgid=$(echo "$file" | cut -d/ -f2 | cut -d. -f1)
i=0
while read j;
do
  ext=$(echo "$j" | grep -oP 'image/[^;]+' | cut -d/ -f2)
  filename="$msgid-$i.$ext"
  echo "$j" | cut -d, -f2 | base64 -d > "output/$filename"
  echo "output/$filename"
  i=$(expr $i + 1)
done < imgs

There is nothing extra to know about unpacking the sent table except it has some additional fields such as ackdata and senttime.