Quantcast
Channel: coderrr » security
Viewing all articles
Browse latest Browse all 10

Simple text watermarking with Unicode

$
0
0

There’s quite a few papers on the watermarking of text. Most of them are pretty complex. I was trying to think of a less robust, but simpler solution, which could help track text being cross posted on websites and blogs. The idea was that you could provide the same block of text with a different watermark to each user. So if the text then showed up later on a blog, you could tell who had “leaked” the text.

I chose to use the spaces between words as the bits for storing the watermark and have the on bit be marked by inserting a zero width unicode character after the space. I decided against inserting a character inside of words because if the unicode character showed up after pasting the text in a non unicode editor, the text would be very un-readable. Inserting between words also allowed for the text to be searchable in the browser. If you had the text Pe[invisible unicode character]ter in your browser and tried to search for “Pet”, your search wouldn’t match it even though it would look jsut like “Peter”. Of course terms with spaces are still unsearchable in my approach.

I tried some other approaches using different unicode space characters but I ran into problems with all of them. This one seems to work the best in Firefox and IE. There’s a crapload of unicode code points so there’s probably a bunch of other possibilities. For example, all the alternative punctuation characters.

Currently the watermark must be an unsigned integer. It would be pretty trivial to make it work with a string.

Here’s the usage:

irb(main):003:0> puts Watermark.apply_watermark('Here is a block of text inside of which a number will be hidden!', 42)
Here is a block of text inside of which a number will be hidden!
irb(main):004:0> Watermark.read_watermark('Here is a block of text inside of which a number will be hidden!')
=> 42

note: The string in the above code actually contains the watermark, but you don’t see it… Try copying the text to a non-unicode aware context

Just in case I’ve also provided a method to convert the unicode characters to HTML entities:

irb(main):011:0> Watermark.apply_watermark('Here is a block of text inside of which a number will be hidden!', 42)
=> "Here is \357\273\277a block \357\273\277of text \357\273\277inside of which a number will be hidden!"
irb(main):012:0> Watermark.escape_unicode _
=> "Here is a block of text inside of which a number will be hidden!"

Here’s the implementation:

class Watermark
  INVISIBLE_SPACE = "\357\273\277"  # U+FEFF
  SPACE_CHARS = [ " ", " #{INVISIBLE_SPACE}" ]
  SPACE_REGEX = Regexp.union(SPACE_CHARS[1], SPACE_CHARS[0])

  class NotEnoughSpacesError < StandardError; end
  class BadWatermarkError < StandardError; end

  class << self
    def apply_watermark(text, watermark)
      verify_watermark_format!(watermark)
      verify_enough_spaces!(text, watermark)

      bits = bit_map(watermark)
      text.gsub(/ /) { SPACE_CHARS[bits.shift || 0] }
    end

    def read_watermark(watermarked_text)
      bit_map = watermarked_text.scan(SPACE_REGEX).map {|c| SPACE_CHARS.index(c) }

      bit = -1
      bit_map.inject(0) { |watermark, on_off| watermark |= (on_off << bit+=1) }
    end

    def escape_unicode(text)
      text.gsub(INVISIBLE_SPACE, "&#xFEFF;")
    end

    private

    def verify_watermark_format!(watermark)
      raise(BadWatermarkError, "only unsigned integers")  if ! watermark.is_a? Integer or watermark < 0
    end

    def verify_enough_spaces!(text, watermark)
      spaces_count = text.scan(/ /).size
      raise NotEnoughSpacesError  if bits_needed(watermark) > spaces_count
    end

    def bits_needed(integer)
      return 1  if integer == 0
      (Math.log(integer+1)/Math.log(2)).ceil  # solve: integer < 2**bits_needed
    end

    def bit_map(integer)
      Array.new(bits_needed(integer)) {|i| i }.map {|bit| [integer & (1 << bit), 1].min }
    end
  end
end


Viewing all articles
Browse latest Browse all 10

Trending Articles