Zawilinski
is a Java library designed specifically to simplify the extraction of
data from Wiktionary entries; more generally, it simplifies the process of loading
and filtering MediaWiki XML dumps so that the resulting document object tree is of
manageable size (i.e., small enough so that generating and accessing the object tree
doesn't cause a typical desktop computer to "thrash").See: Description
Interface | Description |
---|---|
InflectionTemplateLoader.TemplateCallback | |
PostFilter |
Classes that implement this interface determine which
Page s and
Revisions are removed by the post-filter. |
Class | Description |
---|---|
FilterWiktionaryByLanguage |
Program to filter the data for a given language from Wiktionary.
|
FilterWiktionaryByLanguage.MyOptions |
Container for command-line options.
|
FilterWiktionaryByTitle |
Filter a MediaWiki dump for a list of specific articles.
|
InflectionTemplateLoader |
Load the raw inflection template strings from a filtered MediaWiki dump.
|
InflectionTemplateLoader.MyOptions |
Command-line parameters for this class
|
LanguagePrefilter |
A pre-filter that only allows text through that is in the section for the specified language.
|
MediaWikiLoader |
Generate the DOM for a MediaWiki XML document.
|
PageFilterListener |
Handles events from unmarshaller.
|
PostFilterByLanguage |
Keep only those pages that contain some data for the specified language and
remove the trailing level 2 language header left by the prefilter.
|
PostFilteredMediaWikiLoader |
Generates a DOM for a Wiktionary XML document that contains only elements for selected entries.
|
PreFilteredMediaWikiLoader |
Generate the DOM for a MediaWiki XML document from a filtered XML stream.
|
TextPrefilter |
This abstract class simplifies the writing SAX filters that
examine and modify only the contents of MediaWiki
<text> tags. |
TextSizePrefilter |
A
TextPrefilter that limits the number of characters allowed in a given MediaWiki
<text> element. |
Util |
Utilities to simplify the reading and writing of XML documents as well as
navigating
MediaWikiType objects. |
WiktionaryWriter |
Writes an XML object tree as an XML file.
|
Zawilinski |
Logging levels and
Zawilinski.main(String[]) |
Exception | Description |
---|---|
MediaWikiLoader.XMLConfigurationException |
Exception describing a problem with the XML software setup.
|
Zawilinski
is a Java library designed specifically to simplify the extraction of
data from Wiktionary entries; more generally, it simplifies the process of loading
and filtering MediaWiki XML dumps so that the resulting document object tree is of
manageable size (i.e., small enough so that generating and accessing the object tree
doesn't cause a typical desktop computer to "thrash").
Wiktionary and Wikipedia make their data available for study by publishing very large (tens of gigabytes) XML files. There are two basic approaches for loading XML files: stream parsers and tree-based parsers. Stream parsers (e.g., Simple API for XML -- SAX) are efficient; however, writing this type of filter is time consuming and difficult because the programmer has to explicitly keep track of the current position in the XML parse tree and often must write code to buffer previous XML events.
Tree-based parsers (e.g., Document Object Model -- DOM) load the entire XML document into an object tree. Writing code to access data in the object tree is much easier than handling XML events generated by a stream parser; however, a document's object tree is typically 2 to 10 times the size of the document itself. Thus, it is impractical, if not impossible, to load an entire Wikipedia or Wiktionary XML file into an object tree. In fact, as a result of vandalism, there are a few Wiktionary articles that are themselves too large to be loaded into an object tree.The Zawilinksi library combines the efficiency of a stream-based (i.e., SAX) parser with the simplicity of a tree-based (i.e., DOM parser). It processes documents first with a SAX parser, then a DOM parser. Users can place filters on both parsers. The two sets of filters then work together to remove unnecessary data as soon as practical (by which we mean "easy to code"), thereby reducing both the workload of the DOM parser and the memory footprint of the resulting object tree. (Of course, this is only helpful when the desired analysis requires a sufficiently small subset of the XML data.)
Zawilinski is based on three key observations:
<article>
element as well as all
events that occur between the start of the article and the point at which the filter determines whether to keep
the article.)<text>
element).Zawilinski first applies filters (called "pre-filters") to the SAX parser to remove most of the article content unrelated to the research question. It then applies filters (called "post-filters") to the DOM parser to remove the entire article from the object tree. The pre-filters remove the majority of the unnecessary data, thereby significantly reducing the DOM parser's workload and memory requirements. The post-filters then remove the rest of the article. Consequently, the DOM parser avoids building large object sub-trees that will soon be discarded, and the programmer avoids having to explicitly buffer XML events to remove articles at the SAX-processing stage.
Here is how we use Zawilinski to support our study of Polish inflection data in Wiktionary:
The English Wiktionary contains English-language descriptions of words in many different languages. Each article may contain sections for several different languages. For example, the article for "pies" contains an English section discussing the plural of the word "pie" and a Polish section discussing the Polish word "pies", which means "dog". For our study of Polish inflection data in Wiktionary, we are interested in only the Polish section of articles. We are not interested in any articles that do not have a Polish section.
To filter a Wiktionary dump for the data relevant to our study, we defined two pre-filters and one post-filter:
==Polish==
". It then retains all article
data until it encounters the headerContent for the next language section (e.g., "==Spanish==
").
Zawilinski
comes with several stand-alone filter programs; or,
users can use it as a library to support their custom filters.
FilterWiktionaryByLanguage
: Filter a Wiktionary XML dump to retain only
those pages or revisions that contain information for a specified language.
Run java -jar Zawilinski.jar --help
for details.XMLFilter
and can modify the XML stream any way it likes (it is not limited to filtering text); however,
as explained earlier, writing a filter for a SAX parser can be very difficult.
We provide the abstract class TextPrefilter
to simplify the process
of filtering article text. A subclass of TextPrefilter
must define
TextPrefilter.handleTextElementCharacters(char[], int, int)
.
This method examines the characters in the array passed to it, then calls
TextPrefilter.sendCharacters(char[], int, int)
with
those characters that are to pass through the filter. (See TextPrefilter
for details.) SAX may pass the article text to the filter over the course of several events. Thus,
each call to handleTextElementCharacters
may contain only part of the article text.
It is not always necessary to write a pre-filter. If the articles of interest are small enough and few
enough, then using the TextSizePrefilter
as the only pre-filter will be
sufficient.
PostFilter
interface.
This interface defines two methods: keepPage
and keepRevision
. Writing these methods requires
the programmer to know the MediaWiki XML Schema and the
Java classes that correspond to each element in the schema.
(See the edu.gvsu.kurmasz.zawilinski.mw.current
package for details.}
The English Wiktionary dumps can be found at http://dumps.wikimedia.org/enwiktionary/. Clicking on a date takes you to a web page that describes the different files available. Clicking on latest takes you to a list of files. We have tended to use one of two files:
pages-meta-history.xml.(7z|bz2)
: All pages with complete edit history
pages-articles.xml.bz2
: Current versions of article content.
You can download these files by simply clicking on the link on the web page; but, we have found it
more convenient to use curl
:
curl http://dumps.wikimedia.org/enwiktionary/DATE/enwiktionary-DATE-pages-meta-history.xml.bz2 < LOCAL_NAME.xml.bz2
Where DATE is the date of the dump in YYYYMMDD
format, and LOCAL_NAME is the name
you want the file to have locally.