This is an FCS Endpoint implementation for the (No)SketchEngine. It uses the bonito-open API as search backend.
It is being developed by the Leipzig Corpora Collection (LCC) and the Saxon Academy of Sciences and Humanities in Leipzig (SAW) and the code is licensed under MIT.
This repository should only be regarded as basis for own deployments. While templates and example configurations contain LCC specific URLs, those should only be used for testing and if you want to try out this code base! If you want to deploy your own FCS endpoint, please check that you have the permissions to use the specific NoSketchEngine API. You can setup your own NoSketchEngine easily with e.g. ELTE-DH/NoSketch-Engine-Docker.
There is some partial (No)SketchEngine API adapter in d.s.t.w.f.f.noske
that can be extracted and used as is. There is a test case to see its usage besides the one in this endpoint.
Note that there are some basic assumptions about the backend NoSketchEngine searcher.
Those are implementation details and can be seen in the classes d.s.t.w.f.f.NoSkESRUFCSEndpointSearchEngine
and d.s.t.w.f.f.query.FCSQLtoNoSkECQLConverter
.
- We assume that all corpora are freely accessible and that there are not sub-corpora. The endpoint will dynamically configure itself by listing all corpora available and setting the appropriate metadata.
- The corpus
language_id
is an ISO 639-3 identifier, e.g.deu
. - We only have a single (required) structure:
s
, meaning sentence (with optional attributesid
/source
/date
that are not really used at this point). - We use the following attributes:
word
(required),lemma
,pos
(withpos_ud17
) andlc
(required) /lemma_lc
as automatic lower cased variants forword
/lemma
.lemma
andpos
are optional attributes.- The attributes
pos
andpos_ud17
are not completely integrated. At the moment, only thepos
attribute is checked which might not be UD17 (as required by FCS).
- The attributes
Adaptions to own corpus configurations should not be too complicated.
Dockerfile
Multi-stage Maven build and slim Jetty runtime image.docker-compose.yml
pom.xml
Java dependencies for use with Maven..env.template
Template.env
file for Docker deployments.
The following classes live in the de.saw_leipzig.textplus.webservices.fcs.fcs_noske_endpoint
namespace.
d.s.t.w.f.f.NoSkESRUFCSConstants
Constants for accessing FCS request parameters and output generation. Can be used to store own constants.d.s.t.w.f.f.NoSkESRUFCSEndpointSearchEngine
The glue between the FCS and our own search engine. It is the actual implementation that handles SRU/FCS explain and search requests. Here, we load and initialize our FCS endpoint. It will perform searches with our own search engine (here only with static results), and wrap results into the appropriate output (d.s.t.w.f.f.NoSkESRUFCSSearchResultSet
).d.s.t.w.f.f.NoSkESRUFCSSearchResultSet
FCS Data View output generation. Generates the basic HITS and ADVANCED Data Views. Here custom output can be generated from the result wrapperd.s.t.w.f.f.searcher.MyResults
.d.s.t.w.f.f.searcher.MyResults
Lightweight wrapper around own results that allows access to results counts and result items per index and wraps the native result entries with kwic, left and right context as well as some metadata.
d.s.t.w.f.f.query.CQLtoNoSkECQLConverter
Query converion from simple CQL to (No)SketchEngine CQL (CQP) query.d.s.t.w.f.f.query.FCSQLtoNoSkECQLConverter
Query converion from FCS-QL to (No)SketchEngine CQL (CQP) query.
d.s.t.w.f.f.noske.NoSkeAPI
NoSkE Bonito API Client.- Namespace
d.s.t.w.f.f.noske.pojo
NoSkE Bonito API response wrapper classes.
d.s.t.w.f.f.util.LanguagesISO693
Helper class (from FCS SRU Aggregator) that handles conversion between ISO639 Codes and Language names.- src/main/resources/lang/iso-639-3_20230123.tab
Resource file for ISO639 conversion
Only the log4j2.xml
is important in case of changing logging settings.
endpoint-description.xml
FCS Endpoint Description, like resources, capabilities etc.
This file can be used to pre-configure the endpoint, e.g., to restrict the exposed resources. Otherwise, using theFCS_RESOURCES_FROM_NOSKE
parameter, resource information will be queried from the (No)SketchEngine API and all found resources are exposed. The Endpoint Description will be generated programmatically.jetty-env.xml
Jetty environment variable settings.sru-server-config.xml
SRU Endpoint Settings.web.xml
Java Servlet configuration, SRU/FCS endpoint settings.
The configuration (via Java environment variable context) for the endpoint are:
NOSKE_API_URI
: URI; base URI to (No)SketchEngine Bonito endpoint, required!FCS_RESOURCES_FROM_NOSKE
: Boolean, if (No)SketchEngine/corpora
API endpoint should be used to automatically generate the Endpoint Description with the list of resources (corpora). Iffalse
, the embedded or withRESOURCE_INVENTORY_URL
("de.saw_leipzig.textplus.webservices.fcs.fcs_noske_endpoint.resourceInventoryURL"
) specified Endpoint Description file is being used.DEFAULT_RESOURCE_PID
: String, default resource PID for searches where nox-fcs-context
is specified. Take care that you include the possible resource PID prefix, specified ind.s.t.w.f.f.NoSkESRUFCSConstants
.
Build fcs.war
file for webapp deployment:
mvn [clean] package
Some endpoint/resource configurations are being set using environment variables. See jetty-env.xml
for details. You can set default values there.
For production use, you can set values in the .env
file that is then loaded with the docker-compose.yml
configuration. Take a look at the .env.template
file, save a copy to .env
with your own configuration.
This SRU/FCS Endpoint project includes both a Dockerfile
and a docker-compose.yml
configuration.
The Dockerfile
can be used to build a simple Jetty image to run the FCS endpoint. It still needs to be configured with port-mappings, environment variables etc. The docker-compose.yml
file bundles all those runtime configurations to allow easier deployment. You still need to create an .env
file or set the environment variables if you use the generated code as is.
# build the image and label it "fcs-endpoint"
docker build -t fcs-endpoint .
# run the image in the foreground (to see logs and interact with it) with environment variables from .env file
docker run --rm -it --name fcs-endpoint -p 8200:8080 --env-file .env fcs-endpoint
# or run in background with automatic restart
docker run -d --restart=unless-stopped --name fcs-endpoint -p 8200:8080 --env-file .env fcs-endpoint
# build
docker-compose build
# run
docker-compose up [-d]
Uses Jetty 10. See pom.xml
--> plugin jetty-maven-plugin
.
mvn [package] jetty:run-war
NOTE: jetty:run-war
uses built war file in target/
folder.
The search request for something in CQL/BASIC-Search:
curl '127.0.0.1:8080?operation=searchRetrieve&queryType=cql&query=something&x-indent-response=1'
# or port 8200 if run with docker
Add default debug setting Attach by Process ID
, then start the jetty server with the following command, and start debugging in VSCode while it waits to attach.
# export configuration values, see section #Configuration
MAVEN_OPTS="-Xdebug -Xnoagent -Djava.compiler=NONE -agentlib:jdwp=transport=dt_socket,server=y,address=5005" mvn jetty:run-war
There are a few basic tests in src/test/java/d.s.t.w.f.f/
with hopefully more to come...
There exists a custom tests log4j2.xml
configuration file.