Build Your Own Bitcasa

Earlier this week I heard about a new startup called Bitcasa which is offering “infinite” secure cloud storage for a low monthly fee. Now, I’m not particularly interested in relying on a brand-new startup for all of my off-site storage needs, but one of Bitcasa’s technical claims seems to have raised a few eyebrows on the Internet. In particular, I learned from an episode of the podcast Security Now that Bitcasa claims to use exclusively client-side encryption and also to be able to de-duplicate files server-side.

Think about that for a moment, and it may at first seem impossible. How can you de-duplicate plaintext that the server never has access to? But it’s not impossible, and I wouldn’t doubt that many ways of doing this are widely known, but I did find it to be a really interesting computer science brainteaser. Here’s the problem, in my words:

Design a service that allows users to store blocks of data and retrieve them later. The service must have the following properties:

  1. No block is stored more than once.
  2. No more than O(1) additional space is used per user who “owns” a particular block.
  3. No user is able to decrypt a block that he does not own.
  4. The service could not be compelled by any authority to decrypt the blocks it stores.

Extra Credit:

  1. No communication between users, directly or through the service, is required.
  2. The service could not be compelled by any authority to divulge which users own which blocks.

In the Security Now episode in question, the host gave a solution which satisfied 1-4, but not 5 or 6, so I have labelled them as extra credit. If you want to try solving this problem yourself, stop reading now — my solution follows.

First, some notation and definitions. Below I will be making use of two well-known algorithms: AES, a symmetric key cipher, and SHA, a cryptographic hash. Both of these come in various flavors, but the only requirement is that the output of the SHA flavor used must be the same length as the key for the AES flavor used. So, for the sake of argument, let’s say we use AES-256 and SHA-256. I will denote an application of SHA to some data X as SHA(X). I will denote an application of AES to some data X with key K as AES(X, K).

And, before beginning, I should point out that I have no idea if this is even close to what Bitcasa actually does. I don’t even know if this method is well-known and is named after some professor somewhere. This is just the way I came up with to solve the problem as presented after I thought about it for a while.

Storing a New Block

Consider User 1. This user generates a private key PK1 which is kept secret and never sent to the service. When the user logs into the service for the first time, he creates a block list BL1, initially empty. This list will store information about the blocks he owns (details later), and its size will be O(n) in the number of owned blocks. This block list file is encrypted to CL1 = AES(BL1, PK1) and the encrypted list CL1 is uploaded to the service. This file is directly associated with the user’s account and is not de-duplicated. Since the file is linear in the number of owned blocks, the user is contributing only O(1) additional data per owned block, satisfying requirement 2, even though the block list itself is not de-duplicated. The service is unable to decrypt the block list because it is encrypted using PK1, which is never sent to the server.

Now, User 1 has a block B which he wishes to store. He computes SHA(B), the hash of B, and then uses the hash as the key to encrypt the block: C = AES(B, SHA(B)). In other words, the encrypted block C is the original block B encrypted using its own hash as the key. The user then sends C to the server, and the server uses SHA(C) as the unique identifier for the block B. Note that the server cannot decrypt C since the key SHA(B) was never sent to the server. This satisfies requirement 4.

Ownership and De-Duplication

After the block is stored, User 1 must claim ownership of the block. He does this by adding a record to the block list consisting of the tuple: <filename, block number, SHA(B), SHA(C)>. The first two elements are for the user’s benefit only; they don’t have anything to do with how the service works. The filename and block number would allow the client to reassemble the blocks into complete files with names so that they can be presented in a user-friendly way. But what’s important is SHA(B) and SHA(C). Recall that SHA(B) is the key used to encrypt B and SHA(C) is the unique identifier that the service uses for identifying encrypted block C. So what this record says is “I know that using SHA(B), I can decrypt the block identified by SHA(C)“. This is an association that is known only by owners of the file, satisfying requirement 3. The user then saves the new block list file, re-encrypts it using PK1, and re-uploads it to the server. Again, the server is unable to read the contents of this file.

If User 1 later wants to retrieve a block, he requests CL1, the encrypted block list, from the server. He decrypts CL1 into BL1 using PK1, and then looks up the filename and block number he wants to retrieve. He then requests the block with identifier SHA(C) from the server, and the server responds with C. Finally, the user decrypts C into B using SHA(B) and the original plaintext data is available.

Now suppose User 2 with private key PK2 comes along with the same block B and wants to store. He can compute the same encrypted block C = AES(B, SHA(B)) using only his local knowledge. When he tries to send C to the server, the server will notice that it already has a block with identifier SHA(C), and will not store C again, satisfying requirement 1. User 2 can then add <filename, n, SHA(B), SHA(C)> to his block list BL2, encrypt it to CL2 = AES(BL2, PK2), and upload CL2. User 2 can retrieve the original block B in the same way that User 1 did, except by retrieving CL2 instead of CL1. Note that User 1 and User 2 now share an encrypted block C on the server that they can both read, but they never had to communicate with each other, satisfying extra credit requirement 5.

Ownership Anonymity

Finally, consider requirement 6. There’s a potential weakness to this sort of system. Say that block B contained some sort of illicit materials (e.g. pirated media). You could imagine an authority taking a copy of B, computing SHA(C) and then subpoenaing the service for a list of users who own that block. Could the service comply with such a subpoena? Assuming the service is trying to protect the privacy of its users and not doing anything stupid like logging which IPs addresses sent or received which blocks, it would be technically impossible for the service to associate the block identified by SHA(C) with either User 1 or User 2. The only place in which that association exists is within CL1 and CL2, which are encrypted with PK1 and PK2 respectively, and the service does not possess these keys. Therefore, the service is not only secure, but anonymous with respect to the association between users and their data, satisfying extra credit requirement 6.

One final note: Observe that it is not a security flaw if an encrypted block C or an encrypted block list CL were to fall into the hands of someone other than their owner. These are both strongly encrypted, and nothing meaningful can be extracted from either without the appropriate keys, which are posessesed (or attainable) only by the true owners. So the service can happily respond to a request for SHA(C) with C without having to know or care whether or not the requestor is one of the owners of C. The only further security consideration woud be to ensure that a user be authenticated when uploading his encrypted block list, otherwise the service is vulnerable to a denial of service attack by means of a malicious user overwriting another user’s block list with garbage.

Share this content on:



    1. Ah, thanks for the pointer! I did some googling to see if I could find any known flaws with using a hash as an encryption key, but couldn’t get any relevant results. Looks like I was missing the proper keywords “convergent encryption”.

  1. Very interesting read!

    Now Jack, if I’m not mistaken the problems of convergent encryption, which I also learned about through your link, thank you, apply if the encrypted block as well as the hash of this block are known to the attacker. They can try “guessing” the plaintext by hashing their guesses if big parts of the plaintext are known (“boilerplates”).

    If I understand you two correctly, though, this weakness doesn’t apply in the setting Tyler describes. The encrypted block C is encrypted using SHA(B) and neither SHA(B) nor B are ever known to the server or anyone else. The Hash SHA(C) is used to uniquely reference a certain block of ciphertext for deduplication yet it is *useless* in trying to decrypt C into B. So if SHA(B) and C were known, convergent encryption might be a problem. As it is, SHA(B) is buried in the user’s encrypted block list and cannot be retrieved by the service.

    Am I wrong?

    Cheers, Philip

  2. It’s nice to see that I came up with something similar.
    The weakness only sort of applies. If, say, your bank sends a letter to all their customers with your name and a secret, an attacker could try frequent names (or names of known, wealthy customers like the bank’s manager) and brute-force the secret. Countermeasures by limiting the blocks that can be requested in a certain time is not desirable. It is possible to drop requirement 6 and only allow requesting blocks which someone really owns, where an attacker probably has to upload a larger amount of data (yet, considering you only want to guess 4 characters from a 1 MB file, that’s practically nothing). Brute-forcing larger amounts of data could probably be prevented by limiting the number of requests, but the quotas are probably not much lower than that what a service would use to prevent DoS-attacks…
    Anyway, such a system would be a huge improvement compared to what for example dropbox does. And I like how the system does not prevent a service provider from protecting itself against malicious users.

  3. My understanding of bitcasa is as follows:

    1. File(F) -> Hash(SHA) -> SHA(F)
    2. Encrypt F symmetrically with SHA(F) as key -> AES(F,SHA(B))
    3. Encrypt SHA(F) symmetrically with UserPassword PW -> AES(SHA(B),PW)
    4. Upload both(!) to Bitcasa
    5. Bitcasa hashes the encrypted file for deduplicating and indexing ->
    SHA( AES( F, SHA(B) ) )
    Now the user can download both and decrypt the key and then the file.

    My concern is that the UserPassword must be kept on the client and may not be transferred, even with a login-mechanism!

    My other concern is: “Note that the server cannot decrypt C since the key SHA(B) was never sent to the server. This satisfies requirement 4.” This is wrong, unfortunately. You simply have to encrypt it with another (or more) key(s) (backdoor) without notice.

    Therefore, the software must be open source to prevent this kind of silent attack.

Leave a Reply

Your email address will not be published. Required fields are marked *