-
Notifications
You must be signed in to change notification settings - Fork 14
Home
Welcome to the TranskribusPyClient wiki!
https://transkribus.eu/wiki/index.php/REST_Interface
https://transkribus.eu/TrpServer/Swadl/wadl.html
See in git DLA/src/read/TranskribusPyClient . The module client.py offers a subset of the server API.
In sub-package test, there are some example of use.
DLA/src/read/TranskribusCommands contains command line routines.
The Proxy can be indicated as usual in environment variables. (HTTPS_PROXY) Or it can be passed as parameters to the code. (See the constructor)
Pass your login/password as code parameters. Or consider having a Transkribus_credential.py file, where your login password are stored, like below:
# -*- coding: utf-8 -*- login = "[email protected]" password = "my-password-is-here"
Contact person: JL Meunier
In order to provide your Transkribus credentials to each command, there 3 possible ways:
- you create a Transkribus_credential.py module as explained in previous session (set the access right properly to protect your passsword!)
- you provide your Transkribus credentials at each command using the --login and --pwd options.
- you provide you credentials once and persist them using the --persist option . They are stored on disk with appropriate access rights, in a .trnskrbs folder.
do_login.py --persist --login <login> --pwd <password> #To use the persisted session, set the '''--persist''' option in next commands. #To clean the persistent session: do_logout.py
Command to add one or several documents to a target collection.
These objects are all specified by their unique identifier (a number).
NOTE: the documents are NOT duplicated! It is the same document that will appear in target collection in addition to some other collection(s).
USAGE: '''TranskribusCommands/do_addDocToCollec.py''' [--persist] <colId> [ <docId> | <docIdFrom>-<docIdTo> ]+ Documents are specified by a space-separated list of numbers, or number ranges, e.g. 3-36.
Command to duplicate one or several documents from a source collection to a target collection.
These objects are all specified by their unique identifier (a number).
NOTE 1: the new document inherits the name from the source collection.
NOTE 2: Access rights in source collection is required.
Usage: '''./TranskribusCommands/do_duplicateDoc.py''' [--persist] <from_colId> <to_colId> [ <docId> | <docIdFrom>-<docIdTo> ]+ Documents are specified by a space-separated list of numbers, or number ranges, e.g. 3-36.
Command to create one collection.
Usage: '''./TranskribusCommands/do_createCollec.py''' [--persist] <collection_name>
Command to delete one collection.
Usage: '''./TranskribusCommands/do_deleteCollec.py''' [--persist] <colId>
Command to list the content of a collection.
Usage: '''./TranskribusCommands/do_listCollec.py''' [--persist] <colId>
Utility to download a full collection from Transkribus and store it in a conventional DS folder structure. PageXml XMLs, and optionally the images, are downloaded.
In addition a "multi-page" PageXml file is generated for each document. (PageXml is a single page standard, so we changed it, see in: read.xml_formats.multipagecontent.xsd)
Viewing those xml files (.mpxml and .pxml) is possible using wxvisu and its specific wxvisu_PageXml.ini configuration file. See in read.visu.
NOTE: this downloader is lazy and will download the content of a document only if the document timestamp on server is more recent than the one on disk. On the other hand, when renewing the content of a document on disk, it downloads again the whole contents (xml and images), irrespective on which page or transcript was modified
(C:\Anaconda2) python Transkribus_downloader.py --help Usage: '''Transkribus_downloader.py <colid> [<directory>]''' Extract a collection from transkribus and create a DS test structure containing that collection Options: --version show program's version number and exit -h, --help show this help message and exit -s SERVER, --server=SERVER Transkribus server URL -f, --force Force rewrite if disk data is obsolete -l LOGIN, --login=LOGIN Transkribus login (consider storing your credentials in 'transkribus_credentials.py') -p PWD, --pwd=PWD Transkribus password --https_proxy=HTTPS_PROXY proxy, e.g. http://cornillon:8000
> python C:\Local\meunier\git\DLA\src\read\TranskribusCommands\Transkribus_downloader.py 3571 - Done - Downloading collection 3571 to folder . - creating folder: .\trnskrbs_3571 INFO:root:- downloading collection 3571 into folder .\trnskrbs_3571\col (bForce=False) INFO:root:- downloading collection 3571, document 7749 into folder .\trnskrbs_3571\col\7749 (bForce=False) INFO:root:- DONE (downloaded collection 3571, document 7750 into folder .\trnskrbs_3571\col\7750 (bForce=False)) INFO:root:- DONE (downloaded collection 3571 into folder .\trnskrbs_3571\col (bForce=False)) - Done - Generating multi_page PageXml - .\trnskrbs_3571\col\7750 - .\trnskrbs_3571\col\7750.mpxml - .\trnskrbs_3571\col\7749 - .\trnskrbs_3571\col\7749.mpxml - Done, see in .\trnskrbs_3571
Utility to upload the transcripts from a MultiPageXml XML file to a Transkribus collection. The MultiPageXml is plit into PageXMl single-page transcripts, which are then uploaded to become a new version of the transcript for each page of the document(s).
This utility expects to work from the folders created by Transkribus_downloader.py .
Usage:''' Transkribus_transcriptUploader.py <directory> <coldId> [<docId>]'''
Analyze the layout of a page. Currently, this is a pre-requisite before doing HTR.
usage : '''do_analyseLayout.py <colId> <docId> [<pages>] <doNotBlockSeq> <doNotLineSeg>''' '''by default blocks and lines seg are performed. Indicate (0=False/1=True) which actions will be performed''' Display the job ID
do_analyseLayout.py 1949 10462 4 35531 - Done
List the HTR HMM models:
- Done
Apply an HTR HMM models onto the pages of a document: do_htrHmm.py
usage : '''do_htrHmm.py <model-name> <colId> <docId> [<pages>]'''
do_htr.py Wydemann 3829 8620 1 - Done 35313
List the names of the RNN HTR models and the names of the dictionaries. do_listHtrRnn.py
--- Models --------------------------- 20160408_htrts_midfinal_11.sprnn meganet_hist_01_crx.sprnn meganet_us1900_05_us_an_crx.sprnn meganet_usaddr_12_pp_crx.sprnn net_160201_trained_noise.sprnn net_fraktur_0000_2000.sprnn South_Carolina_1720.sprnn GEO_1-3.sprnn GEO_1-3_v2.sprnn Reichsgericht_v4.sprnn IO_Botany_v1.sprnn Resolutions_v1.sprnn Bozen.sprnn escher_v3.sprnn Frisch-Sklaverei.sprnn IO_Botany_v2.sprnn Konzilsprotokolle_v1.sprnn hervetest.sprnn.sprnn StAZH_v1.sprnn Cyrillic_20th_Century.sprnn Hyde_Reel_1_Session_2.sprnn Gothic_Letter_1622.sprnn NB_Norway_Koren.sprnn Egypt_diary.sprnn Sutor.sprnn --- Dictionaries --------------------- alvermann_train.dict deutsch.dict deutscheNachnamen.dict eng.dict fracture.dict frau.dict htrts15_all_sorted.dict mann.dict Bozen_v1.dict Resolutions_v1.dict Reichsgericht_v1.dict StAZH_v1.dict Cyrillic_20th_Century.dict Gothic_Letter_1622.dict NB_Norway_Koren.dict - Done
Apply an HTR RNN model and dictionary onto the pages of a document: do_htrRnn.py
usage : '''do_htrRnn.py <model-name> <dictionary-name> <colId> <docId> [<pages>]'''
$ do_htrRnn.py StAZH_v1.sprnn StAZH_v1.dict 3829 8620 3-4 35442 - Done
The API includes lock-related methods (listPageLocks, lockPage(<bool></bool>), isPageLocked). Should we lock a page we work on? Like when a DU tool does automatic annotation?
If some DU tools does automatic annotation, which status should we set? For now on, we don't set it, and it becomes "in progress", which is ok BTW.
If we generate a ML model for a collection, can we store it somewhere in Transkribus and associate it with the collection?