GitHub - dwiddows/Hello-Wiki: This Java library opens a Wikipedia dump in form of an XML file and extracts the content of the individual articles. Content refers to the article text, a list of the links to other Wikipedia articles and more. It can also create a Lucene index, adding the article contents into different fields.

dwiddows / Hello-Wiki Public

forked from mwahle/Hello-Wiki

Notifications You must be signed in to change notification settings
Fork 0
Star 1

This Java library opens a Wikipedia dump in form of an XML file and extracts the content of the individual articles. Content refers to the article text, a list of the links to other Wikipedia articles and more. It can also create a Lucene index, adding the article contents into different fields.

GPL-3.0 license

1 star 2 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.idea/libraries		.idea/libraries
src/mw		src/mw
.gitignore		.gitignore
LICENSE		LICENSE
README		README
build.xml		build.xml
dependency-reduced-pom.xml		dependency-reduced-pom.xml
pom.xml		pom.xml

Repository files navigation

This file is part of Hello-Wiki.

Hello-Wiki is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

Hello-Wiki is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with Hello-Wiki. If not, see <http://www.gnu.org/licenses/>.

Experimenting with new random indexing methods and information retrieval
methods, the Wikipedia corpus turned out to be an intuitive base for
experiments. But I couldn't really find a tool that would extract the textual
information (stripped of all wiki markup) from the Wikipedia articles. So I
wrote a Java program, intended to be used as a library, to do this job.

Right now, there are two Java classes:
* Extractor.java, which does the actual job of loading and parsing the dump
file and extracting the articles
* MakeLuceneIndex.java, which uses the Extractor to walk through the complete
dump file and puts the contents into a Lucene index

The regular expressions that strip the wiki markup code are not exactly perfect
but were good enough for my purposes. I'm inviting anybody to use this project
and I'll happily include any improvements. Special thanks go to rvedam for
contributing the build.xml and getting me started on Java development.

Dependencies:

Currently, this project depends on the following libraries:
* Apache Lucene (currently version 6.6.0). You can download the Lucene library at
http://www.apache.org/dyn/closer.cgi/lucene/java/

Compilation Instructions (changed from MWahle's original instructions):

- Run "mvn clean install" in the root directory and inform dwiddows if there are problems.