forked from performant-software/juxta-service
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
117 lines (84 loc) · 5.12 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
===============================================================================
Juxta Web Service Service (Juxta WS)
===============================================================================
Synopsis
--------
The Juxta family of software (Juxta, Juxta WS, and Juxta Commons) allows you to
compare and collate versions of the same textual work. The Juxta Web Service (Juxta WS)
is an open source Java application that provides the core collation and
visualization functions of Juxta in a server environment via an API. Development
of Juxta WS was supported by NINES at the University of Virginia.
Features
--------
Juxta WS can collate two or more versions of the same textual work (“witnesses”)
and generate a list of alignments as well as two different styles of visualization
suitable for display on the web. The “heat map” visualization shows a base text
with ranges of difference from the other witnesses highlighted. The “side by side”
visualization shows two of the witnesses in synchronously scrolling columns, with
areas of difference highlighted.
System requirements
-------------------
Juxta WS runs on Java 1.6 or greater. By default it expects to have 1 GB memory availble.
More memory is preferable, since the collation process is memory intensive.
Juxta WS stores source, witness and collation data in a MySQL database. It is
compatible with version greater than 5.1.
Collation Pipeline Overview
----------------------------
The Juxta WS collation process is based on the Gothenburg Model, which breaks
collation in this pipeline of steps:
1 - Raw Content is added to the system and becomes a "source" (XML and TXT are
the accepted formats).
2 - The source is transformed into a "witness," with XML sources being transformed
by XSLT and TXT sources passed through with limited metadata added.
3 - Witness are passed through a tokenizer that breaks text content into tokens
based on whitespace and/or punctuation settings. Those tokens are stored in
the database as offset / length pairs.
4 - Streams of witness tokens are then fed into the diff component, which finds
areas of change. This diff is done with java-diff-utils from Google, and is an
implementation of the Myers diff algorithm.
5 - These differences are aligned in a process by which gaps are inserted into the
token stream such that the tokens of all witnesses match up. These gaps fill in
holes in the token stream where one witness has missing content relative to
another (for more information and examples, see the TEI wiki on Textual Variance).
Once they are aligned, tokens that have been found to have differences are paired,
and this information is stored in the database.
6 - Visualizations leverage the alignment information to inject markup into
witness texts. When passed up to a web browser, this marked-up text is rendered
according to the selected visualization.
Package Components
------------------
juxta-diff - This is ithe component that handless all of the text diff logic.
juxta-indexer - This is a helper component that can be used to create a lucene index
of the sources and witnesses in you main database.
juxta-ws - The main application. It exposes the RESTful API and manages the collation
pipeline.
Installation
------------
First, the MySQL database instance that JuxtaWS uses must be set up. Do this
by issuing the following command from the root directory of the Juxta Service checkout:
./juxta-ws/scripts/init_db.sh [db_user] [db_pass]
If you have setup your MySQL credentials in a .my.cnf file, or your MySQL instance has
no user / password protection, you can leave off both db_* parameters. Similarly,
if you have a user but no password, leave off the db_pass parameter.
Once the database is setup, you can build the service. From the root directory of
the Juxta Service checkout, issue the command:
mvn package
Once this command completes, the web service *.tar.gz can be found in juxta-ws/target/
Extract this to the location of your choice and change to the newly created directory.
The service configuration can be found in config/ws.properties. It can be left as-is
or tuned to fit your needs. Comments in the file provide details on what each setting
is used for.
Launch the service with the startup.sh found in the root directory.
At this point you will have a running service, but searching will not work. To get
that setup, kill the juxta process and do the following:
java -jar juxta-indexer.jar
Extract the binary juxta-indexer *.tar.gz file found in juxta-indexer/target to the
location of you choice. Move to the new directory and issue the command:
java -jar juxta-indexer.jar [db user] [db pass]
Once cmplete, you will find a lucene-index directory has been created. Either move this
directory into the web service directory or create a link to it there. Modify the
web service configuration to point to ths directory. The config property to update is:
juxta.lucene.indexDir
Once that has been set, restart the web service with start.sh
Details of the RESTful API can be found here:
https://github.com/performant-software/juxta-service/wiki/API-Documentation