There’s quite a few papers on the watermarking of text. Most of them are pretty complex. I was trying to think of a less robust, but simpler solution, which could help track text being cross posted on websites and blogs. The idea was that you could provide the same block of text with a different watermark to each user. So if the text then showed up later on a blog, you could tell who had “leaked” the text.
I chose to use the spaces between words as the bits for storing the watermark and have the on bit be marked by inserting a zero width unicode character after the space. I decided against inserting a character inside of words because if the unicode character showed up after pasting the text in a non unicode editor, the text would be very un-readable. Inserting between words also allowed for the text to be searchable in the browser. If you had the text Pe[invisible unicode character]ter in your browser and tried to search for “Pet”, your search wouldn’t match it even though it would look jsut like “Peter”. Of course terms with spaces are still unsearchable in my approach.
I tried some other approaches using different unicode space characters but I ran into problems with all of them. This one seems to work the best in Firefox and IE. There’s a crapload of unicode code points so there’s probably a bunch of other possibilities. For example, all the alternative punctuation characters.
Currently the watermark must be an unsigned integer. It would be pretty trivial to make it work with a string.
Here’s the usage:
irb(main):003:0> puts Watermark.apply_watermark('Here is a block of text inside of which a number will be hidden!', 42)
Here is a block of text inside of which a number will be hidden!
irb(main):004:0> Watermark.read_watermark('Here is a block of text inside of which a number will be hidden!')
=> 42
note: The string in the above code actually contains the watermark, but you don’t see it… Try copying the text to a non-unicode aware context
Just in case I’ve also provided a method to convert the unicode characters to HTML entities:
irb(main):011:0> Watermark.apply_watermark('Here is a block of text inside of which a number will be hidden!', 42)
=> "Here is \357\273\277a block \357\273\277of text \357\273\277inside of which a number will be hidden!"
irb(main):012:0> Watermark.escape_unicode _
=> "Here is a block of text inside of which a number will be hidden!"
Here’s the implementation:
class Watermark
INVISIBLE_SPACE = "\357\273\277" # U+FEFF
SPACE_CHARS = [ " ", " #{INVISIBLE_SPACE}" ]
SPACE_REGEX = Regexp.union(SPACE_CHARS[1], SPACE_CHARS[0])
class NotEnoughSpacesError < StandardError; end
class BadWatermarkError < StandardError; end
class << self
def apply_watermark(text, watermark)
verify_watermark_format!(watermark)
verify_enough_spaces!(text, watermark)
bits = bit_map(watermark)
text.gsub(/ /) { SPACE_CHARS[bits.shift || 0] }
end
def read_watermark(watermarked_text)
bit_map = watermarked_text.scan(SPACE_REGEX).map {|c| SPACE_CHARS.index(c) }
bit = -1
bit_map.inject(0) { |watermark, on_off| watermark |= (on_off << bit+=1) }
end
def escape_unicode(text)
text.gsub(INVISIBLE_SPACE, "")
end
private
def verify_watermark_format!(watermark)
raise(BadWatermarkError, "only unsigned integers") if ! watermark.is_a? Integer or watermark < 0
end
def verify_enough_spaces!(text, watermark)
spaces_count = text.scan(/ /).size
raise NotEnoughSpacesError if bits_needed(watermark) > spaces_count
end
def bits_needed(integer)
return 1 if integer == 0
(Math.log(integer+1)/Math.log(2)).ceil # solve: integer < 2**bits_needed
end
def bit_map(integer)
Array.new(bits_needed(integer)) {|i| i }.map {|bit| [integer & (1 << bit), 1].min }
end
end
end