Luke - Lucene Index Toolbox
Important note: Luke is now hosted at Google Code:
http://code.google.com/p/luke/ and you should go to that page to obtain the latest release of Luke.
This page contains only information about past releases and is no
longer up to date.
Lucene is an Open Source,
mature and high-performance Java search engine. It is highly flexible,
and scalable from hundreds to millions of documents.
Luke is a handy development and diagnostic tool, which accesses
already existing Lucene indexes and allows you to display and modify their
content in several ways:
- browse by document number, or by term
- view documents / copy to clipboard
- retrieve a ranked list of most frequent terms
- execute a search, and browse the results
- analyze search results
- selectively delete documents from the index
- reconstruct the original document fields, edit them and re-insert to the index
- optimize indexes
- and much more...
Recent versions of Luke are also extensible through plugins and scripting.
I started this project because I needed a tool like this. I decided to
distribute it under Open Source license to express my gratitude to the Lucene
team for creating such a high-quality product. Lucene is one of the
landmark proofs that Open Source paradigm can result in high-quality and
free products.
Download - source and binary
See above - this version information is outdated!!!
Current version is 0.9.9, released on September 30, 2009.
It uses the official Lucene 2.9.0 release JARs.
NOTE: Luke requires now Java 1.5 or higher.
You can download the binary JARs here:
- A standalone full JAR, containing Luke, Lucene, Rhino JavaScript,
plugins and additional analyzers (~7MB):
lukeall.jar
There are no external dependencies. This version can be run simply by java -jar lukeall.jar
, or double-click in Windows.
- A standalone minimal JAR, containing Luke and Lucene (~850kB):
lukemin.jar
There are no external dependencies. This version can be run simply by java -jar lukemin.jar
, or double-click in Windows.
- As a separate JAR, one containing Luke - NOTE that you need to supply Lucene jars on the classpath:
luke.jar (~120kB)
Again, remember to put at least the three required JARs on your classpath, e.g.: java -classpath luke.jar;lucene.jar;lucene-misc.jar org.getopt.luke.Luke
You can download the source code ZIP (2MB): luke-src-0.9.9.zip
You can download the source code TGZ (2MB): luke-src-0.9.9.tgz
License
Luke is covered by Apache
Software License, which means that it's free for any use, including
commercial use. It comes with full source code included (see section
above). Notice however that the Thinlet library is covered by
GNU Library (Lesser) Public License, which puts different restrictions
on that portion of the program.
If you feel inclined, I would appreciate a short email note, in case
you find this program useful, or if you want to redistribute it in a
software collection. Although it's not required by the license, it
gives me some idea of how people use it, and what features are most
useful to them...
Bug reports
Hopefully, there will be none! :-) Ok, let's be realistic... if you
notice a bug, or if you come up with a useful feature request, or even
better - with patches that implement new functionality - please contact
the author (Andrzej Bialecki, ). Thank you in
advance for your comments and contributions!
Changes in v. 0.9.9 (released on 2009.09.30):
This release upgrades to Lucene 2.9.0 jars.
- New features and improvements:
- Luke can now open multiple indexes found in subdirectories one
level below the selected directory.
- Added Hadoop Plugin for working with indexes on Hadoop file systems.
This uses Hadoop 0.19.2 and dependencies released with this release of
Hadoop. The plugin uses partial local caching to speed up some operations.
- Term counts and percentages are calculated in a background thread,
so the opening of large indexes should be a little faster. Also, this
operation is skipped for indexes accessed over slow IO (such as HDFS).
- Added More Like This query builder from current document (or its
selected fields).
- Search time is now monitored using System.nanoTime, and the last search time
is preserved in the UI.
- Search can be now repeated N times to get a better estimate of average
search time. Note: the measured time involves only the search(), not the retrieval
of documents.
- AnalyzerTool plugin now uses and illustrates the new token Attribute API.
- Nearly all uses of deprecated Lucene API are replaced with the new API.
- Bug fixes:
- Fix a counter-intuitive behavior where in the Open dialog Luke
chops off the last path element from previously used index path.
- Fix XMLExporter entity escaping, and add a missing quote in term vector size.
- Fix a long-standing Thinlet bug related to tabbedpane with many tabs -
now tabs don't "overflow" the tabbedpane area thus corrupting the display
of surrounding components.
I'd like to thank people who reported problems and suggested improvements:
Craig Stires, Phil Whelan, Andrea Habringer, Benjamin Beckmann, George Herson,
Daniel Noll, Mike McCandless, Chris Pimlott and others.
Changes in v. 0.9.2 (released on 2009.03.20):
This release upgrades to Lucene 2.4.1 jars.
- New features and improvements:
- Added term counts per field in Overview - contributed by Mark Harwood.
- Improved the Analysis plugin to show all token information, and highlight
whenever a token is selected from the list.
- Bug fixes:
Changes in v. 0.9.1 (released on 2008.11.22):
This is mostly a bug fix release of 0.9.
- New features and improvements:
- Added ability to set the maximum count of boolean clauses in BooleanQuery.
- Bug fixes:
- Unbalanced <commit> tags breaking the XML export. Reported by Teruhiko Kurosaka.
- Opening a non-existent index from command-line creates an empty directory.
Reported by Chris Pimlott. See also LUCENE-1464.
- IndexGate inadvertently deleting previous commit points, even if "Keep all commits"
option was specified. Reported by Mark Harwood.
- Empty index with no fields was reported as invalid. Discovered by Andrew Zhang and Michael
McCandless (LUCENE-1454).
Changes in v. 0.9 (released on 2008.11.15):
This release adds many functionality enhancements and advanced features available
in Lucene 2.4.
- New features and improvements:
- Added new tools:
- Check Index - checks Lucene indexes for problems, and can fix some of them.
This is a GUI front-end to the Lucene CheckIndex tool.
- Export to XML - exports index data and metadata to XML file. This is
available both from the GUI and from the command-line.
- Significantly improved Optimize and Cleanup tools.
- Added ability to set norms on any indexed field in a document, or a range of
documents.
- Delete multiple documents by specifying ranges of document numbers.
- Added support for new field functionality: omitTF and binary fields.
- Improve the low-level information about the index, including format version.
- Show interesting details about IndexCommit points and associated files.
- Add short explanations of index files' functions.
- Improve document reconstruction - now the information from TermFreqVector can
be used if available. Also, DocReconstructor can be used outside of Luke.
- Significantly improved advanced search options - QueryParser settings, Similarity
and HitCollector settings.
- Read-only functionality is supported directly in IndexReader.
- Bug fixes:
A lot of effort went into refactoring the code, moving away if possible from the spaghetti
code influenced by Thinlet and into a modular design. Still much needs to be done here. :(
This means that there are likely many more bugs than in the previous release, although I
tested all functionality to make sure that there is no data loss.
HOWEVER, if you work with precious data, it's always a good idea to use the
"Read-only" option.
Changes in v. 0.8.1 (released on 2008.02.13):
This release adds some functionality enhancements related to TermVectors and Payloads.
- New features and improvements:
- When editing document fields it's now possible to specify TermVectors with
offsets and/or positions.
- Added ability to show term vector positions and offsets, if available. It's also
possible to copy this list to the clipboard.
- Added ability to show term positions within a document, and display term payloads
if available, using one of several pre-defined payload decoders. It's also
possible to copy this list to the clipboard.
- It's possible now to view the full content of a stored field using various
content decoders (hex, date / time, number, utf8, arrays of int or float)
- Layout of "Browse by Term" panel is changed so that it better reflects the available
navigation.
- Bug fixes:
- Check added to prevent from adding new documents if no index is open.
- Wrong class was used in IndexGate to represent deletable files, which caused a
ClassCastException.
- Some query types may have been skipped when displaying Explanation.
Changes in v. 0.8 (released on 2008.02.04):
This release upgrades to the official Lucene 2.3.0 release JARs.
NOTE: this version of Luke requires Java 1.5 or higher.
The following changes have been made in this release:
- New features and improvements:
- Added ability to show full text of a field in a popup dialog, both in
plain text and as a hexadecimal dump.
- It's also possible to save the content of a single field to an external file.
This is useful for saving binary fields, or examining exact byte content of a field.
- Added an option to load the index to RAMDirectory. NOTE: obviously you should take into
account the index size vs. the available heap size ... ;)
- GrowableStringArray is a separate public class now - perhaps some day I'll use it to implement
a bulk document reconstruct function.
- Luke remembers now the last Analyzer and last field used in previous session.
- Bug fixes:
- Neither the document nor the field details have the "Boost" column anymore, it's always 1.0f
in documents retrieved from an index. Instead this column now reads "Norms" and shows the fieldNorm
value of a field.
Changes in v. 0.7.1 (released on 2007.06.20):
This minor release is mostly an upgrade to the official Lucene 2.2.0 release JARs.
The following changes have been made in this release:
- New features and improvements:
Added a term distribution analysis plugin by Mark Harwood.
- Bug fixes:
- Fixed IndexGate class to correctly show deletable files.
Changes in v. 0.7 (released on 2007.02.20):
This release uses the official Lucene 2.1.0 release JARs.
The following changes have been made in this release:
- New features and improvements:
- Added pagination of results, especially useful for very large
result sets.
- Added support for new Field flags. They are now
displayed in the Document details, and also can be set on
edited documents.
- Added a function to add new documents to the index.
- Low-level index operations (such as detecting unused files,
index directory cleanup) use the newly exposed Lucene classes
instead of duplicating their internals in Luke.
- A side-effect of the above is the ability to properly
cleanup all supported index formats, including the new lockless
and single-norm indexes.
- Added a function to copy the list of top terms to clipboard.
- Added a function to copy the term vector to clipboard.
- Added a function to close and/or re-open the current index.
- In the Documents tab, pressing "First Term" now positions the
term enumeration at the first term for the selected field.
- Added a field vocabulary analysis plugin by Mark Harwood, with
some modifications.
- Overall UI cleanup - improved layout in some places,
added graphics instead of ASCII art, etc.
- Bug fixes:
- Fixed a bug in index size calculation.
- Fixed a bug in term browsing - when "First Term" was pressed
in reality the second term was shown.
The following people contributed patches, suggestions, and
generally kept prodding me and poking to produce this release:
Volodymyr Bychkoviak,
Juan Manuel Caicedo,
Mark Harwood,
Otis Gospodnetic,
Benson Margulies,
Jean-Philippe Robichaud,
and many, many others. Thank you for your support!
Changes in v. 0.6:
The most important addition is the scripting framework based on Mozilla Rhino
JavaScript engine. Additional plugins and functions were added, as follows:
- The query view shows not only a parsed form,
but also a re-written query form.
- Query Structure shows internal structure of a query.
Explanations are provided both for the parsed and rewritten
queries.
- Command-line argument parsing. Now you can open an index on
startup, and optionally execute a script (see below).
- Custom Similarity designer plugin, which allows you to design
and test your own Similarity implementation.
- Scripting plugin, which allows you to interactively experiment
with Luke and Lucene indexes. This plugin also can run scripts
from Luke command-line.
- Ability to use MMapDirectory. Due to limitations in Lucene API
this feature relies on reflection API, and may sometimes fail if a
restrictive SecurityManager is in use. The Overview panel shows
which Directory implementation is used.
- Proper display of overlapping tokens (created with analyzers that
use setPositionIncrement(0) ).
- Field names are sorted in alphabetical order.
- Characters in field values, which are not letters or digits or plain
ASCII, are now displayed using entities, or common escapes (e.g.
"\r\n\t" or "Ӓ")
- Document boost is shown, in addition to each field's boost.
- The "Files" tab now shows which files found in the index directory
really belong to the index, and what is their status.
- A new function, "Index Cleanup", cleans up all files from index
directory that do not belong to the index, and all files that are
marked as deletable in the index. This action does NOT optimize the index.
- Luke has been reviewed and restructured to provide better support
for execution as a part of other applications (or JavaScript
scripts). Javadoc comments were added.
- The default binary bundle (lukeall.jar) now contains also a
collection of analyzers from the "contrib" area, as well as the
Snowball analyzers.
- UI font selection: sometimes the field values contain characters
not covered in the default 'dialog' font, like e.g. less common
Unicode glyphs. Or maybe you just wish to view all UI text rendered
in Garamond... now you can ;-)
- UI color theme: you can now switch color themes for your eye's
pleasure.
I would like to thank the following people for their comments, suggestions,
bug-fixes and patches (in no particular order): Daniel Naber, Erik Hatcher,
Grant Ingersoll, Ryan Cox, Terry Steichen, Lubos Pochman, Michael Franken,
Luke Shannon, Todd VanderVeen, and others. Thank you!
Changes in v. 0.5:
This release introduces many changes and new, unique features:
- NEW: Added support for Term Vectors.
- NEW: Added a plugin framework - plugins found on classpath are
detected automatically and added to the new "Plugins" tab.
Note however that for now plugins do NOT work when using Java WebStart.
- NEW: A sample plugin provided, based on Mark Harwood's
"tool for analyzing analyzers".
- NEW: all tables support resizable columns now. Some dialogs are
also resizable.
- NEW: Added Reconstruct functionality. Using this function users
can reconstruct the content of all (also unstored) fields of a
document. This function uses a brute-force approach, so it may be
slow for larger indexes (> 500,000 docs).
- NEW: Added "pseudo-edit" functionality. New document editor
dialog allows to modify reconstructed documents, and add or replace
the original ones.
- FIX: problems with MRU list solved, and a framework for handling
preferences introduced.
- FIX: the list of available Analyzers is now dynamically
populated from the classpath, using the same method as in the
AnalyzerTool plugin.
- FIX: restructured source repository and added Ant build script.
Please note that as a result of the package name changes, the main class is
org.getopt.luke.Luke
, and NOT as before luke.Luke
.
I felt that all these changes merited a slight change in name, from "Lucene
Index Browser" to "Lucene Index Toolbox", as this seems to better reflect the
current functionality of the tool.
Changes in v. 0.45:
- Added more details to the Overview panel.
- Add support for undeleting all deleted documents.
- Add Boost column to Document view.
- Use nicer formatting for numbers in the Explain window.
- Fix for not updating the parsed query view when pressing Search.
- Fix the JNLP file to require J2SE 1.3+.
- By popular demand, add a single self-contained JAR to the binary distribution.
- Minor restructuring to increase reuse.
Changes in v. 0.4:
- Use Lucene 1.3-FINAL. The WebStart version has been changed, so
that it uses two separate JARs - one contains Luke, the other Lucene.
- Added support for compound index format. It's also possible to
change the format during optimization.
- visualization of the query parsing. When you change the Analyzer
or default field, or perform a Search, you can see the QueryParser's
idea of what the final query looks like. Suggested by Erik Hatcher.
- added functionality to view the explanation for a hit.
- bugfix for broken behavior: when selecting "Show All Docs" on the
"Documents" view, the program would use a QueryParser, whereas it should
simply construct a primitive TermQuery. This bug could result in
mysterious "No Results" on the search page. Spotted by Erik Hatcher.
I'll update the screenshots in a few days ...
Changes in v. 0.3:
- Add several enhancements and bugfixes contributed by Ryan Cox:
- drop-down choice with most recently used indexes
- list of files in an index
- information about relative index changes after optimization
- timing of searches
- Bugfix: reload field list after opening another index
- various small UI cleanups
Changes in v. 0.2:
- Add Java WebStart version.
- Add Read-Only mode.
- Fix spinbox bug (really a bug in the Thinlet toolkit - fixed there).
- Allow to browse hidden directories.
- Add a combobox to choose the default field for searching.
- Other minor code cleanups.
Screenshots
That's what tiggers love the most...
The following screenshot present the overview screen, just after you
open an index.
The screenshot below shows you the document panel, where you can browse
through documents sequentially, or select groups of documents by terms,
which they contain.
The next screenshot shows you the Search panel, where you can enter
search expressions in the standard Lucene QueryParser syntax. However,
notice that you can select analyzer used to parse the query - either one
of the predefined ones, or your own class in a classpath. You can also
select the default field (this field is used when there is no specific
field qualifier in your search expression).
You can also see in the "Parsed query view" area how the choice of analyzer affects the final query. In this case, please note how the phrase "more and more" has changed.
The screenshot below shows a dialog containing the explanation for a hit.
The Explanation tree shows how various term matches and normalizations
resulted in the final document score for the current query.
Please note how the fuzzy query expanded the term "book" into "books" (and, not visible here, "bookstore", "bookstores", etc...), adjusting the weight of this hit.
Last modified: Nov 14, 2008