Skip to content
Hervé Déjean edited this page Jul 6, 2017 · 14 revisions

Welcome to the TranskribusPyClient wiki!

Table of Contents

Reference Documents:

https://transkribus.eu/wiki/index.php/REST_Interface

https://transkribus.eu/TrpServer/Swadl/wadl.html

Code

See in git DLA/src/read/TranskribusPyClient . The module client.py offers a subset of the server API.

In sub-package test, there are some example of use.

DLA/src/read/TranskribusCommands contains command line routines.

Note on the proxy settings

The Proxy can be indicated as usual in environment variables. (HTTPS_PROXY) Or it can be passed as parameters to the code. (See the constructor)

on Transkribus Login

Pass your login/password as code parameters. Or consider having a Transkribus_credential.py file, where your login password are stored, like below:

 # -*- coding: utf-8 -*-
 login = "[email protected]"
 password = "my-password-is-here"

Contact person: JL Meunier

Command Line Utilities

Persistent login

In order to provide your Transkribus credentials to each command, there 3 possible ways:

  1. you create a Transkribus_credential.py module as explained in previous session (set the access right properly to protect your passsword!)
  2. you provide your Transkribus credentials at each command using the --login and --pwd options.
  3. you provide you credentials once and persist them using the --persist option . They are stored on disk with appropriate access rights, in a .trnskrbs folder.
  do_login.py --persist --login <login> --pwd <password>
  #To use the persisted session, set the '''--persist''' option in next commands.
  #To clean the persistent session:
  do_logout.py

Collections

Add Document(s) to Collection

Command to add one or several documents to a target collection.

These objects are all specified by their unique identifier (a number).

NOTE: the documents are NOT duplicated! It is the same document that will appear in target collection in addition to some other collection(s).

  USAGE: '''TranskribusCommands/do_addDocToCollec.py''' [--persist] <colId>  [ <docId> | <docIdFrom>-<docIdTo> ]+
  Documents are specified by a space-separated list of numbers, or number ranges, e.g. 3-36.

Duplicate Document(s) from Collection to Collection

Command to duplicate one or several documents from a source collection to a target collection.

These objects are all specified by their unique identifier (a number).

NOTE 1: the new document inherits the name from the source collection.

NOTE 2: Access rights in source collection is required.

  Usage: '''./TranskribusCommands/do_duplicateDoc.py''' [--persist] <from_colId>  <to_colId> [ <docId> | <docIdFrom>-<docIdTo> ]+
  Documents are specified by a space-separated list of numbers, or number ranges, e.g. 3-36.

Create a Collection

Command to create one collection.

  Usage: '''./TranskribusCommands/do_createCollec.py''' [--persist] <collection_name>

Delete a Collection

Command to delete one collection.

  Usage: '''./TranskribusCommands/do_deleteCollec.py''' [--persist] <colId>

List a Collection

Command to list the content of a collection.

  Usage: '''./TranskribusCommands/do_listCollec.py''' [--persist] <colId>

Transkribus_downloader

Utility to download a full collection from Transkribus and store it in a conventional DS folder structure. PageXml XMLs, and optionally the images, are downloaded.

In addition a "multi-page" PageXml file is generated for each document. (PageXml is a single page standard, so we changed it, see in: read.xml_formats.multipagecontent.xsd)

Viewing those xml files (.mpxml and .pxml) is possible using wxvisu and its specific wxvisu_PageXml.ini configuration file. See in read.visu.

NOTE: this downloader is lazy and will download the content of a document only if the document timestamp on server is more recent than the one on disk. On the other hand, when renewing the content of a document on disk, it downloads again the whole contents (xml and images), irrespective on which page or transcript was modified

 (C:\Anaconda2) python Transkribus_downloader.py --help
 Usage: '''Transkribus_downloader.py <colid> [<directory>]'''
 Extract a collection from transkribus and create a DS test structure containing that collection
 Options:
 --version show program's version number and exit
 -h, --help show this help message and exit
 -s SERVER, --server=SERVER Transkribus server URL
 -f, --force Force rewrite if disk data is obsolete
 -l LOGIN, --login=LOGIN Transkribus login (consider storing your credentials in 'transkribus_credentials.py')
 -p PWD, --pwd=PWD Transkribus password
 --https_proxy=HTTPS_PROXY proxy, e.g. http://cornillon:8000
 > python C:\Local\meunier\git\DLA\src\read\TranskribusCommands\Transkribus_downloader.py 3571
  - Done
 - Downloading collection 3571 to folder .
 - creating folder: .\trnskrbs_3571
 INFO:root:- downloading collection 3571 into folder .\trnskrbs_3571\col (bForce=False)
  INFO:root:- downloading collection 3571, document 7749 into folder .\trnskrbs_3571\col\7749 (bForce=False)
  INFO:root:- DONE (downloaded collection 3571, document 7750 into folder .\trnskrbs_3571\col\7750 (bForce=False))
 INFO:root:- DONE (downloaded collection 3571 into folder .\trnskrbs_3571\col (bForce=False))
 - Done
 - Generating multi_page PageXml
  - .\trnskrbs_3571\col\7750
  - .\trnskrbs_3571\col\7750.mpxml
  - .\trnskrbs_3571\col\7749
  - .\trnskrbs_3571\col\7749.mpxml
 - Done, see in .\trnskrbs_3571

TranskribusDU_transcriptUploader

Utility to upload the transcripts from a MultiPageXml XML file to a Transkribus collection. The MultiPageXml is plit into PageXMl single-page transcripts, which are then uploaded to become a new version of the transcript for each page of the document(s).

This utility expects to work from the folders created by Transkribus_downloader.py .

  Usage:''' Transkribus_transcriptUploader.py   <directory>   <coldId>   [<docId>]'''

LA (Layout Analysis)

analyze the Layout

Analyze the layout of a page. Currently, this is a pre-requisite before doing HTR.

  usage : '''do_analyseLayout.py <colId> <docId> [<pages>] <doNotBlockSeq> <doNotLineSeg>'''
  '''by default blocks and lines seg are performed. Indicate (0=False/1=True) which actions will be performed''' 
  Display the job ID
  do_analyseLayout.py  1949 10462 4
  35531
  - Done

Recognition

list the HTR HMM Models

List the HTR HMM models:

    - Done

apply an HTR HMM Model

Apply an HTR HMM models onto the pages of a document: do_htrHmm.py

  usage : '''do_htrHmm.py <model-name> <colId> <docId> [<pages>]'''
  do_htr.py Wydemann 3829 8620 1
  - Done
  35313

list the HTR RNN Models and Dictionaries

List the names of the RNN HTR models and the names of the dictionaries. do_listHtrRnn.py

  --- Models ---------------------------
  20160408_htrts_midfinal_11.sprnn
  meganet_hist_01_crx.sprnn
  meganet_us1900_05_us_an_crx.sprnn
  meganet_usaddr_12_pp_crx.sprnn
  net_160201_trained_noise.sprnn
  net_fraktur_0000_2000.sprnn
  South_Carolina_1720.sprnn
  GEO_1-3.sprnn
  GEO_1-3_v2.sprnn
  Reichsgericht_v4.sprnn
  IO_Botany_v1.sprnn
  Resolutions_v1.sprnn
  Bozen.sprnn
  escher_v3.sprnn
  Frisch-Sklaverei.sprnn
  IO_Botany_v2.sprnn
  Konzilsprotokolle_v1.sprnn
  hervetest.sprnn.sprnn
  StAZH_v1.sprnn
  Cyrillic_20th_Century.sprnn
  Hyde_Reel_1_Session_2.sprnn
  Gothic_Letter_1622.sprnn
  NB_Norway_Koren.sprnn
  Egypt_diary.sprnn
  Sutor.sprnn
  
  --- Dictionaries ---------------------
  alvermann_train.dict
  deutsch.dict
  deutscheNachnamen.dict
  eng.dict
  fracture.dict
  frau.dict
  htrts15_all_sorted.dict
  mann.dict
  Bozen_v1.dict
  Resolutions_v1.dict
  Reichsgericht_v1.dict
  StAZH_v1.dict
  Cyrillic_20th_Century.dict
  Gothic_Letter_1622.dict
  NB_Norway_Koren.dict
  
  - Done

apply an HTR RNN Model

Apply an HTR RNN model and dictionary onto the pages of a document: do_htrRnn.py

   usage : '''do_htrRnn.py <model-name> <dictionary-name> <colId> <docId> [<pages>]'''
   $ do_htrRnn.py StAZH_v1.sprnn StAZH_v1.dict 3829 8620 3-4
   35442
   - Done

(Non-Urgent) Questions

Locking

The API includes lock-related methods (listPageLocks, lockPage(<bool></bool>), isPageLocked). Should we lock a page we work on? Like when a DU tool does automatic annotation?

Page Status

If some DU tools does automatic annotation, which status should we set? For now on, we don't set it, and it becomes "in progress", which is ok BTW.

Storing Data

If we generate a ML model for a collection, can we store it somewhere in Transkribus and associate it with the collection?

Clone this wiki locally