org.archive.util
Class BloomFilter32bp2

java.lang.Object
  extended by org.archive.util.BloomFilter32bp2
All Implemented Interfaces:
java.io.Serializable, BloomFilter

public class BloomFilter32bp2
extends java.lang.Object
implements java.io.Serializable, BloomFilter

A Bloom filter. SLIGHTLY ADAPTED VERSION OF MG4J it.unimi.dsi.mg4j.util.BloomFilter

KEY CHANGES:


Instances of this class represent a set of character sequences (with false positives) using a Bloom filter. Because of the way Bloom filters work, you cannot remove elements.

Bloom filters have an expected error rate, depending on the number of hash functions used, on the filter size and on the number of elements in the filter. This implementation uses a variable optimal number of hash functions, depending on the expected number of elements. More precisely, a Bloom filter for n character sequences with d hash functions will use ln 2 dn ≈ 1.44 dn bits; false positives will happen with probability 2-d.

Hash functions are generated at creation time using universal hashing. Each hash function uses NUMBER_OF_WEIGHTS random integers, which are cyclically multiplied by the character codes in a character sequence. The resulting integers are XOR-ed together.

This class exports access methods that are very similar to those of Set, but it does not implement that interface, as too many non-optional methods would be unimplementable (e.g., iterators).

Author:
Sebastiano Vigna
See Also:
Serialized Form

Field Summary
protected static int ADDRESS_BITS_PER_UNIT
           
protected static int BIT_INDEX_MASK
           
 int d
          The number of hash functions used by this filter.
 long m
          The number of bits in this filter.
static int NUMBER_OF_WEIGHTS
          The number of weights used to create hash functions.
 long power
          the power-of-two that m is
 
Constructor Summary
BloomFilter32bp2(int n, int d)
          Creates a new Bloom filter with given number of hash functions and expected number of elements.
 
Method Summary
 boolean add(java.lang.CharSequence s)
          Adds a character sequence to the filter.
 boolean contains(java.lang.CharSequence s)
          Checks whether the given character sequence is in this filter.
protected  boolean getBit(int bitIndex)
          Returns from the local bitvector the value of the bit with the specified index.
 long getSizeBytes()
          The amount of memory in bytes consumed by the bloom bitfield.
protected  void setBit(int bitIndex)
          Changes the bit with index bitIndex in local bitvector.
 int size()
          The number of character sequences in the filter.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

NUMBER_OF_WEIGHTS

public static final int NUMBER_OF_WEIGHTS
The number of weights used to create hash functions.

See Also:
Constant Field Values

m

public final long m
The number of bits in this filter.


power

public final long power
the power-of-two that m is


d

public final int d
The number of hash functions used by this filter.


ADDRESS_BITS_PER_UNIT

protected static final int ADDRESS_BITS_PER_UNIT
See Also:
Constant Field Values

BIT_INDEX_MASK

protected static final int BIT_INDEX_MASK
See Also:
Constant Field Values
Constructor Detail

BloomFilter32bp2

public BloomFilter32bp2(int n,
                        int d)
Creates a new Bloom filter with given number of hash functions and expected number of elements.

Parameters:
n - the expected number of elements.
d - the number of hash functions; if the filter add not more than n elements, false positives will happen with probability 2-d.
Method Detail

size

public int size()
The number of character sequences in the filter.

Specified by:
size in interface BloomFilter
Returns:
the number of character sequences in the filter (but see contains(CharSequence)).

contains

public boolean contains(java.lang.CharSequence s)
Checks whether the given character sequence is in this filter.

Note that this method may return true on a character sequence that is has not been added to the filter. This will happen with probability 2-d, where d is the number of hash functions specified at creation time, if the number of the elements in the filter is less than n, the number of expected elements specified at creation time.

Specified by:
contains in interface BloomFilter
Parameters:
s - a character sequence.
Returns:
true if the sequence is in the filter (or if a sequence with the same hash sequence is in the filter).

add

public boolean add(java.lang.CharSequence s)
Adds a character sequence to the filter.

Specified by:
add in interface BloomFilter
Parameters:
s - a character sequence.
Returns:
true if the character sequence was not in the filter (but see contains(CharSequence)).

getBit

protected boolean getBit(int bitIndex)
Returns from the local bitvector the value of the bit with the specified index. The value is true if the bit with the index bitIndex is currently set; otherwise, returns false. (adapted from cern.colt.bitvector.QuickBitVector)

Parameters:
bitIndex - the bit index.
Returns:
the value of the bit with the specified index.

setBit

protected void setBit(int bitIndex)
Changes the bit with index bitIndex in local bitvector. (adapted from cern.colt.bitvector.QuickBitVector)

Parameters:
bitIndex - the index of the bit to be set.

getSizeBytes

public long getSizeBytes()
Description copied from interface: BloomFilter
The amount of memory in bytes consumed by the bloom bitfield.

Specified by:
getSizeBytes in interface BloomFilter
Returns:
memory used by bloom bitfield, in bytes


Copyright © 2003-2008 Internet Archive. All Rights Reserved.