Accumulo has an iterator called the intersecting iterator which supports querying a term index that is partitioned by document, or "sharded". This example shows how to use the intersecting iterator through these four programs:
- Index.java - Indexes a set of text files into an Accumulo table
- Query.java - Finds documents containing a given set of terms.
- Reverse.java - Reads the index table and writes a map of documents to terms into another table.
- ContinuousQuery.java - Uses the table populated by Reverse.java to select N random terms per document. Then it continuously and randomly queries those terms.
To run these example programs, create two tables like below.
username@instance> createnamespace examples
username@instance> createtable examples.shard
username@instance examples.shard> createtable examples.doc2term
After creating the tables, index some files. The following command indexes all the java files in the Accumulo source code.
$ find /path/to/accumulo/core -name "*.java" | xargs ./bin/runex shard.Index -t examples.shard --partitions 30
The following command queries the index to find all files containing 'foo' and 'bar'.
$ ./bin/runex shard.Query -t examples.shard foo bar
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/spi/balancer/BaseHostRegexTableLoadBalancerTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/iterators/user/WholeRowIteratorTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/iteratorsImpl/IteratorConfigUtilTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/data/KeyBuilderTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/spi/balancer/HostRegexTableLoadBalancerReconfigurationTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/security/ColumnVisibilityTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/summary/SummaryCollectionTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/spi/balancer/HostRegexTableLoadBalancerTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/client/IteratorSettingTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/data/KeyExtentTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/security/VisibilityEvaluatorTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/iterators/user/TransformingIteratorTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/client/admin/NewTableConfigurationTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/conf/HadoopCredentialProviderTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/clientImpl/TableOperationsHelperTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/iterators/user/WholeColumnFamilyIteratorTest.java
In order to run ContinuousQuery, we need to run Reverse.java to populate the examples.doc2term
table.
$ ./bin/runex shard.Reverse --shardTable examples.shard --doc2Term examples.doc2term
Below ContinuousQuery is run using 5 terms. So it selects 5 random terms from each document, then it continually randomly selects one set of 5 terms and queries. It prints the number of matching documents and the time in seconds.
$ ./bin/runex shard.ContinuousQuery --shardTable examples.shard --doc2Term examples.doc2term --terms 5
[string, protected, sizeopt, cache, build] 1 0.084
[public, these, exception, to, as] 25 0.267
[by, encodeprevendrow, 0, work, as] 4 0.056
[except, to, a, limitations, one] 969 0.197
[copy, as, asf, version, is] 969 0.341
[core, class, may, regarding, without] 862 0.437
[max_data_to_print, default_visibility_cache_size, use, accumulo_export_info, fate] 1 0.066