There’s quite a few papers on the watermarking of text. Most of them are pretty complex. I was trying to think of a less robust, but simpler solution, which could help track text being cross posted on websites and blogs. The idea was that you could provide the same block of text with a different watermark to each user. So if the text then showed up later on a blog, you could tell who had “leaked” the text.
I chose to use the spaces between words as the bits for storing the watermark and have the on bit be marked by inserting a zero width unicode character after the space. I decided against inserting a character inside of words because if the unicode character showed up after pasting the text in a non unicode editor, the text would be very un-readable. Inserting between words also allowed for the text to be searchable in the browser. If you had the text Pe[invisible unicode character]ter in your browser and tried to search for “Pet”, your search wouldn’t match it even though it would look jsut like “Peter”. Of course terms with spaces are still unsearchable in my approach.
I tried some other approaches using different unicode space characters but I ran into problems with all of them. This one seems to work the best in Firefox and IE. There’s a crapload of unicode code points so there’s probably a bunch of other possibilities. For example, all the alternative punctuation characters.
Currently the watermark must be an unsigned integer. It would be pretty trivial to make it work with a string.
Here’s the usage:
irb(main):003:0> puts Watermark.apply_watermark('Here is a block of text inside of which a number will be hidden!', 42) Here is a block of text inside of which a number will be hidden! irb(main):004:0> Watermark.read_watermark('Here is a block of text inside of which a number will be hidden!') => 42
note: The string in the above code actually contains the watermark, but you don’t see it… Try copying the text to a non-unicode aware context
Just in case I’ve also provided a method to convert the unicode characters to HTML entities:
irb(main):011:0> Watermark.apply_watermark('Here is a block of text inside of which a number will be hidden!', 42) => "Here is \357\273\277a block \357\273\277of text \357\273\277inside of which a number will be hidden!" irb(main):012:0> Watermark.escape_unicode _ => "Here is a block of text inside of which a number will be hidden!"
Here’s the implementation:
class Watermark INVISIBLE_SPACE = "\357\273\277" # U+FEFF SPACE_CHARS = [ " ", " #{INVISIBLE_SPACE}" ] SPACE_REGEX = Regexp.union(SPACE_CHARS[1], SPACE_CHARS[0]) class NotEnoughSpacesError < StandardError; end class BadWatermarkError < StandardError; end class << self def apply_watermark(text, watermark) verify_watermark_format!(watermark) verify_enough_spaces!(text, watermark) bits = bit_map(watermark) text.gsub(/ /) { SPACE_CHARS[bits.shift || 0] } end def read_watermark(watermarked_text) bit_map = watermarked_text.scan(SPACE_REGEX).map {|c| SPACE_CHARS.index(c) } bit = -1 bit_map.inject(0) { |watermark, on_off| watermark |= (on_off << bit+=1) } end def escape_unicode(text) text.gsub(INVISIBLE_SPACE, "") end private def verify_watermark_format!(watermark) raise(BadWatermarkError, "only unsigned integers") if ! watermark.is_a? Integer or watermark < 0 end def verify_enough_spaces!(text, watermark) spaces_count = text.scan(/ /).size raise NotEnoughSpacesError if bits_needed(watermark) > spaces_count end def bits_needed(integer) return 1 if integer == 0 (Math.log(integer+1)/Math.log(2)).ceil # solve: integer < 2**bits_needed end def bit_map(integer) Array.new(bits_needed(integer)) {|i| i }.map {|bit| [integer & (1 << bit), 1].min } end end end