Luke - Lucene Index Toolbox

Important note: Luke is now hosted at Google Code: http://code.google.com/p/luke/ and you should go to that page to obtain the latest release of Luke. This page contains only information about past releases and is no longer up to date.

Lucene is an Open Source, mature and high-performance Java search engine. It is highly flexible, and scalable from hundreds to millions of documents.

Luke is a handy development and diagnostic tool, which accesses already existing Lucene indexes and allows you to display and modify their content in several ways:

browse by document number, or by term
view documents / copy to clipboard
retrieve a ranked list of most frequent terms
execute a search, and browse the results
analyze search results
selectively delete documents from the index
reconstruct the original document fields, edit them and re-insert to the index
optimize indexes
and much more...

Recent versions of Luke are also extensible through plugins and scripting.

I started this project because I needed a tool like this. I decided to distribute it under Open Source license to express my gratitude to the Lucene team for creating such a high-quality product. Lucene is one of the landmark proofs that Open Source paradigm can result in high-quality and free products.

Download - source and binary

See above - this version information is outdated!!!

Current version is 0.9.9, released on September 30, 2009.

It uses the official Lucene 2.9.0 release JARs.

NOTE: Luke requires now Java 1.5 or higher.

You can download the binary JARs here:

A standalone full JAR, containing Luke, Lucene, Rhino JavaScript, plugins and additional analyzers (~7MB):
lukeall.jar
There are no external dependencies. This version can be run simply by java -jar lukeall.jar, or double-click in Windows.
A standalone minimal JAR, containing Luke and Lucene (~850kB):
lukemin.jar
There are no external dependencies. This version can be run simply by java -jar lukemin.jar, or double-click in Windows.
As a separate JAR, one containing Luke - NOTE that you need to supply Lucene jars on the classpath:
luke.jar (~120kB)
Again, remember to put at least the three required JARs on your classpath, e.g.: java -classpath luke.jar;lucene.jar;lucene-misc.jar org.getopt.luke.Luke

You can download the source code ZIP (2MB): luke-src-0.9.9.zip
You can download the source code TGZ (2MB): luke-src-0.9.9.tgz

License

Luke is covered by Apache Software License, which means that it's free for any use, including commercial use. It comes with full source code included (see section above). Notice however that the Thinlet library is covered by GNU Library (Lesser) Public License, which puts different restrictions on that portion of the program.

If you feel inclined, I would appreciate a short email note, in case you find this program useful, or if you want to redistribute it in a software collection. Although it's not required by the license, it gives me some idea of how people use it, and what features are most useful to them...

Bug reports

Hopefully, there will be none! :-) Ok, let's be realistic... if you notice a bug, or if you come up with a useful feature request, or even better - with patches that implement new functionality - please contact the author (Andrzej Bialecki,

). Thank you in advance for your comments and contributions!

Changes in v. 0.9.9 (released on 2009.09.30):

This release upgrades to Lucene 2.9.0 jars.

New features and improvements:
- Luke can now open multiple indexes found in subdirectories one level below the selected directory.
- Added Hadoop Plugin for working with indexes on Hadoop file systems. This uses Hadoop 0.19.2 and dependencies released with this release of Hadoop. The plugin uses partial local caching to speed up some operations.
- Term counts and percentages are calculated in a background thread, so the opening of large indexes should be a little faster. Also, this operation is skipped for indexes accessed over slow IO (such as HDFS).
- Added More Like This query builder from current document (or its selected fields).
- Search time is now monitored using System.nanoTime, and the last search time is preserved in the UI.
- Search can be now repeated N times to get a better estimate of average search time. Note: the measured time involves only the search(), not the retrieval of documents.
- AnalyzerTool plugin now uses and illustrates the new token Attribute API.
- Nearly all uses of deprecated Lucene API are replaced with the new API.
Bug fixes:
- Fix a counter-intuitive behavior where in the Open dialog Luke chops off the last path element from previously used index path.
- Fix XMLExporter entity escaping, and add a missing quote in term vector size.
- Fix a long-standing Thinlet bug related to tabbedpane with many tabs - now tabs don't "overflow" the tabbedpane area thus corrupting the display of surrounding components.

I'd like to thank people who reported problems and suggested improvements: Craig Stires, Phil Whelan, Andrea Habringer, Benjamin Beckmann, George Herson, Daniel Noll, Mike McCandless, Chris Pimlott and others.

Changes in v. 0.9.2 (released on 2009.03.20):

This release upgrades to Lucene 2.4.1 jars.

New features and improvements:
- Added term counts per field in Overview - contributed by Mark Harwood.
- Improved the Analysis plugin to show all token information, and highlight whenever a token is selected from the list.
Bug fixes:
- (None)

Changes in v. 0.9.1 (released on 2008.11.22):

This is mostly a bug fix release of 0.9.

New features and improvements:
- Added ability to set the maximum count of boolean clauses in BooleanQuery.
Bug fixes:
- Unbalanced <commit> tags breaking the XML export. Reported by Teruhiko Kurosaka.
- Opening a non-existent index from command-line creates an empty directory. Reported by Chris Pimlott. See also LUCENE-1464.
- IndexGate inadvertently deleting previous commit points, even if "Keep all commits" option was specified. Reported by Mark Harwood.
- Empty index with no fields was reported as invalid. Discovered by Andrew Zhang and Michael McCandless (LUCENE-1454).

Changes in v. 0.9 (released on 2008.11.15):

This release adds many functionality enhancements and advanced features available in Lucene 2.4.

New features and improvements:
- Added new tools:
  - Check Index - checks Lucene indexes for problems, and can fix some of them. This is a GUI front-end to the Lucene CheckIndex tool.
  - Export to XML - exports index data and metadata to XML file. This is available both from the GUI and from the command-line.
- Significantly improved Optimize and Cleanup tools.
- Added ability to set norms on any indexed field in a document, or a range of documents.
- Delete multiple documents by specifying ranges of document numbers.
- Added support for new field functionality: omitTF and binary fields.
- Improve the low-level information about the index, including format version.
- Show interesting details about IndexCommit points and associated files.
- Add short explanations of index files' functions.
- Improve document reconstruction - now the information from TermFreqVector can be used if available. Also, DocReconstructor can be used outside of Luke.
- Significantly improved advanced search options - QueryParser settings, Similarity and HitCollector settings.
- Read-only functionality is supported directly in IndexReader.
Bug fixes:

A lot of effort went into refactoring the code, moving away if possible from the spaghetti code influenced by Thinlet and into a modular design. Still much needs to be done here. :(

This means that there are likely many more bugs than in the previous release, although I tested all functionality to make sure that there is no data loss.

HOWEVER, if you work with precious data, it's always a good idea to use the "Read-only" option.

Changes in v. 0.8.1 (released on 2008.02.13):

This release adds some functionality enhancements related to TermVectors and Payloads.

New features and improvements:
- When editing document fields it's now possible to specify TermVectors with offsets and/or positions.
- Added ability to show term vector positions and offsets, if available. It's also possible to copy this list to the clipboard.
- Added ability to show term positions within a document, and display term payloads if available, using one of several pre-defined payload decoders. It's also possible to copy this list to the clipboard.
- It's possible now to view the full content of a stored field using various content decoders (hex, date / time, number, utf8, arrays of int or float)
- Layout of "Browse by Term" panel is changed so that it better reflects the available navigation.
Bug fixes:
- Check added to prevent from adding new documents if no index is open.
- Wrong class was used in IndexGate to represent deletable files, which caused a ClassCastException.
- Some query types may have been skipped when displaying Explanation.

Changes in v. 0.8 (released on 2008.02.04):

This release upgrades to the official Lucene 2.3.0 release JARs.

NOTE: this version of Luke requires Java 1.5 or higher.

The following changes have been made in this release:

New features and improvements:
- Added ability to show full text of a field in a popup dialog, both in plain text and as a hexadecimal dump.
- It's also possible to save the content of a single field to an external file. This is useful for saving binary fields, or examining exact byte content of a field.
- Added an option to load the index to RAMDirectory. NOTE: obviously you should take into account the index size vs. the available heap size ... ;)
- GrowableStringArray is a separate public class now - perhaps some day I'll use it to implement a bulk document reconstruct function.
- Luke remembers now the last Analyzer and last field used in previous session.
Bug fixes:
- Neither the document nor the field details have the "Boost" column anymore, it's always 1.0f in documents retrieved from an index. Instead this column now reads "Norms" and shows the fieldNorm value of a field.

Changes in v. 0.7.1 (released on 2007.06.20):

This minor release is mostly an upgrade to the official Lucene 2.2.0 release JARs.

The following changes have been made in this release:

New features and improvements:
Bug fixes:
- Fixed IndexGate class to correctly show deletable files.

Changes in v. 0.7 (released on 2007.02.20):

This release uses the official Lucene 2.1.0 release JARs.

The following changes have been made in this release:

New features and improvements:
- Added pagination of results, especially useful for very large result sets.
- Added support for new Field flags. They are now displayed in the Document details, and also can be set on edited documents.
- Added a function to add new documents to the index.
- Low-level index operations (such as detecting unused files, index directory cleanup) use the newly exposed Lucene classes instead of duplicating their internals in Luke.
- A side-effect of the above is the ability to properly cleanup all supported index formats, including the new lockless and single-norm indexes.
- Added a function to copy the list of top terms to clipboard.
- Added a function to copy the term vector to clipboard.
- Added a function to close and/or re-open the current index.
- In the Documents tab, pressing "First Term" now positions the term enumeration at the first term for the selected field.
- Added a field vocabulary analysis plugin by Mark Harwood, with some modifications.
- Overall UI cleanup - improved layout in some places, added graphics instead of ASCII art, etc.
Bug fixes:
- Fixed a bug in index size calculation.
- Fixed a bug in term browsing - when "First Term" was pressed in reality the second term was shown.

The following people contributed patches, suggestions, and generally kept prodding me and poking to produce this release: Volodymyr Bychkoviak, Juan Manuel Caicedo, Mark Harwood, Otis Gospodnetic, Benson Margulies, Jean-Philippe Robichaud, and many, many others. Thank you for your support!

Changes in v. 0.6:

The most important addition is the scripting framework based on Mozilla Rhino JavaScript engine. Additional plugins and functions were added, as follows:

The query view shows not only a parsed form, but also a re-written query form.
Query Structure shows internal structure of a query. Explanations are provided both for the parsed and rewritten queries.
Command-line argument parsing. Now you can open an index on startup, and optionally execute a script (see below).
Custom Similarity designer plugin, which allows you to design and test your own Similarity implementation.
Scripting plugin, which allows you to interactively experiment with Luke and Lucene indexes. This plugin also can run scripts from Luke command-line.
Ability to use MMapDirectory. Due to limitations in Lucene API this feature relies on reflection API, and may sometimes fail if a restrictive SecurityManager is in use. The Overview panel shows which Directory implementation is used.
Proper display of overlapping tokens (created with analyzers that use setPositionIncrement(0) ).
Field names are sorted in alphabetical order.
Characters in field values, which are not letters or digits or plain ASCII, are now displayed using entities, or common escapes (e.g. "\r\n\t" or "Ӓ")
Document boost is shown, in addition to each field's boost.
The "Files" tab now shows which files found in the index directory really belong to the index, and what is their status.
A new function, "Index Cleanup", cleans up all files from index directory that do not belong to the index, and all files that are marked as deletable in the index. This action does NOT optimize the index.
Luke has been reviewed and restructured to provide better support for execution as a part of other applications (or JavaScript scripts). Javadoc comments were added.
The default binary bundle (lukeall.jar) now contains also a collection of analyzers from the "contrib" area, as well as the Snowball analyzers.
UI font selection: sometimes the field values contain characters not covered in the default 'dialog' font, like e.g. less common Unicode glyphs. Or maybe you just wish to view all UI text rendered in Garamond... now you can ;-)
UI color theme: you can now switch color themes for your eye's pleasure.

I would like to thank the following people for their comments, suggestions, bug-fixes and patches (in no particular order): Daniel Naber, Erik Hatcher, Grant Ingersoll, Ryan Cox, Terry Steichen, Lubos Pochman, Michael Franken, Luke Shannon, Todd VanderVeen, and others. Thank you!

Changes in v. 0.5:

This release introduces many changes and new, unique features:

NEW: Added support for Term Vectors.
NEW: Added a plugin framework - plugins found on classpath are detected automatically and added to the new "Plugins" tab. Note however that for now plugins do NOT work when using Java WebStart.
NEW: A sample plugin provided, based on Mark Harwood's "tool for analyzing analyzers".
NEW: all tables support resizable columns now. Some dialogs are also resizable.
NEW: Added Reconstruct functionality. Using this function users can reconstruct the content of all (also unstored) fields of a document. This function uses a brute-force approach, so it may be slow for larger indexes (> 500,000 docs).
NEW: Added "pseudo-edit" functionality. New document editor dialog allows to modify reconstructed documents, and add or replace the original ones.
FIX: problems with MRU list solved, and a framework for handling preferences introduced.
FIX: the list of available Analyzers is now dynamically populated from the classpath, using the same method as in the AnalyzerTool plugin.
FIX: restructured source repository and added Ant build script.

Please note that as a result of the package name changes, the main class is org.getopt.luke.Luke, and NOT as before luke.Luke.

I felt that all these changes merited a slight change in name, from "Lucene Index Browser" to "Lucene Index Toolbox", as this seems to better reflect the current functionality of the tool.

Changes in v. 0.45:

Added more details to the Overview panel.
Add support for undeleting all deleted documents.
Add Boost column to Document view.
Use nicer formatting for numbers in the Explain window.
Fix for not updating the parsed query view when pressing Search.
Fix the JNLP file to require J2SE 1.3+.
By popular demand, add a single self-contained JAR to the binary distribution.
Minor restructuring to increase reuse.

Changes in v. 0.4:

Use Lucene 1.3-FINAL. The WebStart version has been changed, so that it uses two separate JARs - one contains Luke, the other Lucene.
Added support for compound index format. It's also possible to change the format during optimization.
visualization of the query parsing. When you change the Analyzer or default field, or perform a Search, you can see the QueryParser's idea of what the final query looks like. Suggested by Erik Hatcher.
added functionality to view the explanation for a hit.
bugfix for broken behavior: when selecting "Show All Docs" on the "Documents" view, the program would use a QueryParser, whereas it should simply construct a primitive TermQuery. This bug could result in mysterious "No Results" on the search page. Spotted by Erik Hatcher.

I'll update the screenshots in a few days ...

Changes in v. 0.3:

Add several enhancements and bugfixes contributed by Ryan Cox:
- drop-down choice with most recently used indexes
- list of files in an index
- information about relative index changes after optimization
- timing of searches
- Bugfix: reload field list after opening another index
various small UI cleanups

Changes in v. 0.2:

Add Java WebStart version.
Add Read-Only mode.
Fix spinbox bug (really a bug in the Thinlet toolkit - fixed there).
Allow to browse hidden directories.
Add a combobox to choose the default field for searching.
Other minor code cleanups.

Screenshots

That's what tiggers love the most...

The following screenshot present the overview screen, just after you open an index.

The screenshot below shows you the document panel, where you can browse through documents sequentially, or select groups of documents by terms, which they contain.

The next screenshot shows you the Search panel, where you can enter search expressions in the standard Lucene QueryParser syntax. However, notice that you can select analyzer used to parse the query - either one of the predefined ones, or your own class in a classpath. You can also select the default field (this field is used when there is no specific field qualifier in your search expression).
You can also see in the "Parsed query view" area how the choice of analyzer affects the final query. In this case, please note how the phrase "more and more" has changed.

The screenshot below shows a dialog containing the explanation for a hit. The Explanation tree shows how various term matches and normalizations resulted in the final document score for the current query.
Please note how the fuzzy query expanded the term "book" into "books" (and, not visible here, "bookstore", "bookstores", etc...), adjusting the weight of this hit.

Last modified: Nov 14, 2008