Skip to content
Hervé Déjean edited this page Nov 21, 2019 · 14 revisions

Table of Contents

Reference Documents:

https://transkribus.eu/wiki/index.php/REST_Interface

https://transkribus.eu/TrpServer/Swadl/wadl.html

Code

See in git TranskribusPyClient . The module client.py offers a subset of the server API.

In sub-package test, there are some example of use.

TranskribusCommands contains command line routines.

Note on the proxy settings

The Proxy can be indicated as usual in environment variables. (HTTPS_PROXY) Or it can be passed as parameters to the code. (See the constructor)

on Transkribus Login

Pass your login/password as code parameters. Or consider having a Transkribus_credential.py file, where your login password are stored, like below:

 # -*- coding: utf-8 -*-
 login = "[email protected]"
 password = "my-password-is-here"

Contact person: JL Meunier

Command Line Utilities

Persistent login

In order to provide your Transkribus credentials to each command, there 3 possible ways:

  1. you create a Transkribus_credential.py module as explained in previous session (set the access right properly to protect your passsword!)
  2. you provide your Transkribus credentials at each command using the --login and --pwd options.
  3. you provide you credentials once and persist them using the --persist option . They are stored on disk with appropriate access rights, in a .trnskrbs folder.
  do_login.py --persist --login <login> --pwd <password>
  #To use the persisted session, set the '''--persist''' option in next commands.
  #To clean the persistent session:
  do_logout.py

Collections

Add Document(s) to Collection

Command to add one or several documents to a target collection.

These objects are all specified by their unique identifier (a number).

NOTE: the documents are NOT duplicated! It is the same document that will appear in target collection in addition to some other collection(s).

  USAGE: '''TranskribusCommands/do_addDocToCollec.py''' [--persist] <colId>  [ <docId> | <docIdFrom>-<docIdTo> ]+
  Documents are specified by a space-separated list of numbers, or number ranges, e.g. 3-36.

Duplicate Document(s) from Collection to Collection

Command to duplicate one or several documents from a source collection to a target collection.

These objects are all specified by their unique identifier (a number).

NOTE 1: the new document inherits the name from the source collection.

NOTE 2: Access rights in source collection is required.

  Usage: '''./TranskribusCommands/do_duplicateDoc.py''' [--persist] <from_colId>  <to_colId> [ <docId> | <docIdFrom>-<docIdTo> ]+
  Documents are specified by a space-separated list of numbers, or number ranges, e.g. 3-36.

Create a Collection

Command to create one collection.

  Usage: '''./TranskribusCommands/do_createCollec.py''' [--persist] <collection_name>

Delete a Collection

Command to delete one collection.

  Usage: '''./TranskribusCommands/do_deleteCollec.py''' [--persist] <colId>

List a Collection

Command to list the content of a collection.

  Usage: '''./TranskribusCommands/do_listCollec.py''' [--persist] <colId>

Managing transcripts of a document

Command to selectively list or remove the transcripts of a document, or update their status.

This command works in 3 stages:

  1. Filtering: you can look at all transcripts per page or only the last one. Then you can filter based on the page number, or the transcript date, status, author. The command does a AND of all filters, in other words, a selected transcript satisfies all filters. After filtering, you can also keep only the last transcript per page.
  2. Checking: you can check that the transcripts selected by the filter verify certain conditions, based on transcript status and author. If the condition is not met for one or more selected transcript(s), the operation is not performed (apart the 'list' operation)
  3. Acting: an operation applies to the selected transcript, currently you can list, remove them or update their status.
Filtering the last transcript of each page

You can either look at the last transcript of each page, before applying any filter, or after filtering, keep only the last transcript among those filtered.

To keep only the last transcript of each page before filtering:

  --last

To keep only last transcript after filtering:

  --last_filtered
Filtering based on Page Numbers

You can provide a combination of single page numbers and intervals, for instance:

  1
  1,3
  5-7
  1,3,5-7,9
Filtering based on Dates

You can provide zero to many time intervals, which are either closed intervals or half-bounded closed intervals. A point in time is specified according to ISO 8601, e.g. 2017-02-05T22:43:56+0200 or 2017-02-05T21:43:56Z. Note that the timezone must be specified (Z denotes UTC timezone). By default the command will use the timezone of your machine, unless you specifiy the option --utc. NOTE: the command ignores any daylight saving time (DST) practice. For example:

  --after  2017-01-01T00:00:00Z
  --after  2017-01-01T00:00:00Z --before 2017:12:31T23:59:59Z
  --within 2017-01-01T00:00:00Z/2017:12:31T23:59:59Z
  --at     2017-01-01T00:00:00Z
Filtering or Checking based on Status

Filter transcripts to keep only those whose status is one of the given status (it is a OR). To negate a status, prefix it with the / character (double a / to escape it).

Filter transcripts to keep only those whose status is one of the given status(es) (it is a OR) AND not any of the negated status(es), which are prefixed by '/'.

For instance, to filter transcript with status A or B and not C, specify: --status=A --status=B --status=/C

(Double a / to escape it.)

Filter to keep only the transcript with stats NEW or IN_PROGRESS:

  --status NEW --status IN_PROGRESS

Check that the status of any selected transcript is one of those given (it is a OR).

  --check_status NEW
  --check_status NEW --check_status IN_PROGRESS

Check that the status of any selected transcript is one of those given (it is a OR) and not 'FOO'.

  --check_status NEW --check_status IN_PROGRESS --check_status /FOO
Filtering or Checking based on User

Filter transcripts to keep only those whose author is one of the given user (it is a OR) AND not any of the negated users, whose names are prefixed by '/'.

For instance, to filter transcript authored by A or B and not C, specify: --user=A --user=B --user=/C

(Double a / to escape it.)

Filter to keep only transcripts authored by either Jean-Luc or Hervé:

  --user [email protected] --user [email protected]

Check that the author of any selected transcript is one of those given users (it is a OR).

  --check_user [email protected]

Filter to keep only transcripts authored by either Jean-Luc or Hervé, and not those authored by another person:

  --user [email protected] --user [email protected] --user /[email protected]
Generating a TRP file

Using the --trp opton, you produce a TRP file that reflects the selected transcripts. This file can be used to download those transcripts and only those.

Operation

By default, the command does a --list. But it can --remove (CAUTION!) the transcript selected by your filter(s), if they meet all the check(s) you specified. It can also update the status of the selected transcripts --set_status <status></status>.

Usage
This command first filters transcripts based on user specification, before checking user&#39;s specification on filtered transcript.
Eventually, the retrieved transcripts are listed, or removed, or their status is updated.

Page range is a comma&#45;separated series of integer or pair of integers separated by a &#39;&#45;&#39;
For instance 1  or 1,3  or 1&#45;4 or 1,3&#45;6,8

Date takes the form&#58;
        YYYY&#45;MM&#45;DDThh&#58;mm&#58;ss+HHMM  like 2017&#45;09&#45;04T18&#58;30&#58;20+0100
        YYYY&#45;MM&#45;DDThh&#58;mm&#58;ss&#45;HHMM  like 2017&#45;09&#45;04T18&#58;30&#58;20&#45;0100
        YYYY&#45;MM&#45;DDThh&#58;mm&#58;ssZ  like 2017&#45;09&#45;04T18&#58;30&#58;20Z
    Incomplete dates are converted into the first millisecond of the given period. For instance 2017 is equivalent to 2017&#45;01&#45;01T00&#58;00&#58;00
Alternatively, it can be a timestamp (number of milliseconds since 1970&#45;01&#45;01)
&#45;&#45;utc option will show UTC times



Managiong the transcripts of one or several document(s) or of a whole
collection. Pass your login/password as options otherwise consider having a
Transkribus_credential.py file, which defines a &#39;login&#39; and a &#39;pwd&#39; variables.
If you need to use a proxy, use the &#45;&#45;https_proxy option or set the
environment variables HTTPS_PROXY.   To use HTTP Basic Auth with your proxy,
use the http&#58;//user&#58;password@host/ syntax.

Options&#58;
  &#45;&#45;version             show program&#39;s version number and exit
  &#45;h, &#45;&#45;help            show this help message and exit
  &#45;s SERVER, &#45;&#45;server&#61;SERVER
                        Transkribus server URL
  &#45;l LOGIN, &#45;&#45;login&#61;LOGIN
                        Transkribus login (consider storing your credentials
                        in &#39;transkribus_credentials.py&#39;)
  &#45;p PWD, &#45;&#45;pwd&#61;PWD     Transkribus password
  &#45;&#45;persist             Try using an existing persistent session, or log&#45;in
                        and persists the session.
  &#45;&#45;https_proxy&#61;HTTPS_PROXY
                        proxy, e.g. http&#58;//XXX&#58;8000
  &#45;&#45;last                filter (i.e. keep) only last transcript of each page
                        before any filtering occurs.
  &#45;&#45;after&#61;AFTER         filter (i.e. keep) transcripts created on or after
                        this date.
  &#45;&#45;before&#61;BEFORE       filter (i.e. keep) transcripts created on or before
                        this date.
  &#45;&#45;within&#61;WITHIN       filter (i.e. keep) transcripts created within this
                        range(s) of dates.
  &#45;&#45;at&#61;AT               filter (i.e. keep) transcripts created at a date(s).
  &#45;&#45;user&#61;USER           filter (i.e. keep) transcripts that were authored by
                        this or these users.
  &#45;&#45;status&#61;STATUS       filter (i.e. keep) transcripts that have this or these
                        status(es).
  &#45;&#45;last_filtered       filter (i.e. keep) only last transcript, if any, of
                        each page (done after any other filter).
  &#45;&#45;check_user&#61;CHECK_USER
                        Check that each filtered transcript was authored by
                        one of these users.
  &#45;&#45;check_status&#61;CHECK_STATUS
                        Check that each filtered transcript have on of these
                        statuses.
  &#45;&#45;utc                 Show UTC time.
  &#45;&#45;trp&#61;FILENAME        Store the TRP data reflecting the filtered transcripts in the given file.
  &#45;&#45;list                List   the filtered transcripts.
  &#45;&#45;rm                  Remove the filtered transcripts. (CAUTION)
  &#45;&#45;set_status&#61;SET_STATUS
                        Set the filtered transcripts&#39; status.

Transkribus_downloader

Utility to download a full collection from Transkribus and store it in a conventional DU folder structure. PageXml XMLs (one XML file per page), and optionally the images, are downloaded. A multi-page XML is created per document.

In addition a "multi-page" PageXml file is generated for each document. (PageXml is a single page standard, so we changed it, see in: read.xml_formats.multipagecontent.xsd)

Viewing those xml files (.mpxml and .pxml) is possible using mpxml_viewer and its specific .ini configuration file. See in read.visu.

NOTE: this downloader is lazy and will download the content of a document only if the document timestamp on server is more recent than the one on disk. On the other hand, when renewing the content of a document on disk, it downloads again the whole contents (xml and images), irrespective on which page or transcript was modified

In --trp mode, the download is not lazy and does not generate a multi-page PageXml. --force in this mode will overwrite any data.





 > python C:\Local\meunier\git\DLA\src\read\TranskribusCommands\Transkribus_downloader.py 3571
  - Done
 - Downloading collection 3571 to folder .
 - creating folder: .\trnskrbs_3571
 INFO:root:- downloading collection 3571 into folder .\trnskrbs_3571\col (bForce=False)
  INFO:root:- downloading collection 3571, document 7749 into folder .\trnskrbs_3571\col\7749 (bForce=False)
  INFO:root:- DONE (downloaded collection 3571, document 7750 into folder .\trnskrbs_3571\col\7750 (bForce=False))
 INFO:root:- DONE (downloaded collection 3571 into folder .\trnskrbs_3571\col (bForce=False))
 - Done
 - Generating multi_page PageXml
  - .\trnskrbs_3571\col\7750
  - .\trnskrbs_3571\col\7750.mpxml
  - .\trnskrbs_3571\col\7749
  - .\trnskrbs_3571\col\7749.mpxml
 - Done, see in .\trnskrbs_3571

Transkribus_uploader

Utility to upload a full collection to Transkribus from a conventional DU folder structure created by the transkribus-downloader. This tool uploads the PageXml files (one XML file per page, i.e. the .pxml files).

In standard mode, the last transcripts of each page is uploaded, thanks to the trp.json files stored by the downloader to guide the upload. SO KEEP FILES CONSISTENCY!

In particular, the status of the uploaded transcript will be the same as its parent, as indicated in the trp.json file. The --trp option allows you to use another TRP file of your choice, typically one generated using do_Transcript.py command.

The --set_status option allows you to force the given status to the newly uploaded transcripts. (The status must be one of the Transkribus known statuses.)

 Upload the transcript(s) from the DS structure to Transkribus, either of the
 collection or one of its document(s).  The &lt;directory&gt;&lt;/directory&gt; must have been created
 by transkribus_downloader.py and should contain the 'col' directory and a
 trp.json file for the collection, and one per document (the 'out', 'ref',
 'run', 'xml' folders are not used). The page transcript from the single page
 PageXml files are uploaded. (The multi-page xml file(s) are ignored))     Pass
 your login/password as options otherwise consider having a
 Transkribus_credential.py file, which defines a 'login' and a 'pwd' variables.
 If you need to use a proxy, use the --https_proxy option or set the
 environment variables HTTPS_PROXY.   To use HTTP Basic Auth with your proxy,
 use the http://user:password@host/ syntax.
 
 Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -s SERVER, --server=SERVER
                        Transkribus server URL
  -l LOGIN, --login=LOGIN
                        Transkribus login (consider storing your credentials
                        in 'transkribus_credentials.py')
  -p PWD, --pwd=PWD     Transkribus password
  --persist             Try using an existing persistent session, or log-in
                        and persists the session.
  --https_proxy=HTTPS_PROXY
                        proxy, e.g. http://cornillon:8000
  -q, --quiet           Quiet mode
  --trp=TRP             download the content specified by the trp file.
  --toolname=TOOL       Set the Toolname metadata in Transkribus.
  --message=TEXT        Set the Message metadata in Transkribus.

TranskribusDU_transcriptUploader

Utility to upload the transcripts from a MultiPageXml XML file to a Transkribus collection. The MultiPageXml is plit into PageXMl single-page transcripts, which are then uploaded to become a new version of the transcript for each page of the document(s).

This utility expects to work on complete documents from the folders created by Transkribus_downloader.py

Beware: do not use the --trp option at download because then, uploading the multi-page xml way will overlay any pages in the .mxml starting from page 1.

  Usage:''' Transkribus_transcriptUploader.py  [--toolname] [--message] &lt;directory&gt;&lt;/directory&gt;   &lt;coldid&gt;&lt;/coldid&gt;   [&lt;docid&gt;&lt;/docid&gt;]'''

LA (Layout Analysis)

analyze the Layout

Analyze the layout of a page by batch method. Currently, this is a pre-requisite before doing HTR.

  usage : '''do_analyseLayout.py &lt;colid&gt;&lt;/colid&gt; &lt;docid&gt;&lt;/docid&gt; [&lt;pages&gt;&lt;/pages&gt;] &lt;donotblockseq&gt;&lt;/donotblockseq&gt; &lt;donotlineseg&gt;&lt;/donotlineseg&gt;'''
  '''by default blocks and lines seg are performed. Indicate (0=False/1=True) which actions will be performed''' 
  Display the job ID
  do_analyseLayout.py  1949 10462 4
  35531
  - Done

analyze the Layout New (URO baseline Finder)

Analyze the layout (create lines and regions) of a doc/pages. Currently, this is a pre-requisite before doing HTR.

  usage : '''do_analyseLayoutNew.py &lt;colid&gt;&lt;/colid&gt; '''
  -r REGION, --region=REGION
                        apply Layout Analysis (textLine)
  --trp=TRP_DOC         use trp doc file
  --docid=DOCID         document/pages to be analyzed
  --doRegionSeg         do Region detection
  --batchjob            do one job per page
  Display the job ID
  do_analyseLayoutNew.py  1949 --docid=10462/4
  35531
  - Done

analyze the Layout (batch)

Analyze the layout of a page by batch method. Currently, this is a pre-requisite before doing HTR.

  usage : '''do_analyseLayoutBatch.py &lt;colid&gt;&lt;/colid&gt; &lt;docid&gt;&lt;/docid&gt; [&lt;pages&gt;&lt;/pages&gt;] &lt;donotblockseq&gt;&lt;/donotblockseq&gt; &lt;donotlineseg&gt;&lt;/donotlineseg&gt;'''
  '''by default blocks and lines seg are performed. Indicate (0=False/1=True) which actions will be performed''' 
  Display the job ID
  do_analyseLayoutBatch.py  1949 10462 4
  35531
  - Done

Table Tempate Matching

Usage: do_tableTemplate.py <colid></colid> <docid/pagerange></docid/pagerange>





  do_tableTemplate.py --templateID 6163718 27593 103626                                          
  job ID:['435759']
  - Done

Recognition

list the HTR HMM Models

List the HTR HMM models:

    $ '''do_listHtrHmm.py'''
    - Done

apply an HTR HMM Model

Apply an HTR HMM models onto the pages of a document: do_htrHmm.py

  usage : '''do_htrHmm.py &lt;model&#45;name&gt;&lt;/model&#45;name&gt; &lt;colid&gt;&lt;/colid&gt; &lt;docid&gt;&lt;/docid&gt; [&lt;pages&gt;&lt;/pages&gt;]'''
  do_htr.py Wydemann 3829 8620 1
  - Done
  35313

list the HTR RNN Models and Dictionaries

List the names of the RNN HTR models and the names of the dictionaries. do_listHtrRnn.py

  --- Models ---------------------------
  20160408_htrts_midfinal_11.sprnn
  meganet_hist_01_crx.sprnn
  meganet_us1900_05_us_an_crx.sprnn
  meganet_usaddr_12_pp_crx.sprnn
  net_160201_trained_noise.sprnn
  net_fraktur_0000_2000.sprnn
  South_Carolina_1720.sprnn
  GEO_1-3.sprnn
  GEO_1-3_v2.sprnn
  Reichsgericht_v4.sprnn
  IO_Botany_v1.sprnn
  Resolutions_v1.sprnn
  Bozen.sprnn
  escher_v3.sprnn
  Frisch-Sklaverei.sprnn
  IO_Botany_v2.sprnn
  Konzilsprotokolle_v1.sprnn
  hervetest.sprnn.sprnn
  StAZH_v1.sprnn
  Cyrillic_20th_Century.sprnn
  Hyde_Reel_1_Session_2.sprnn
  Gothic_Letter_1622.sprnn
  NB_Norway_Koren.sprnn
  Egypt_diary.sprnn
  Sutor.sprnn
  
  --- Dictionaries ---------------------
  alvermann_train.dict
  deutsch.dict
  deutscheNachnamen.dict
  eng.dict
  fracture.dict
  frau.dict
  htrts15_all_sorted.dict
  mann.dict
  Bozen_v1.dict
  Resolutions_v1.dict
  Reichsgericht_v1.dict
  StAZH_v1.dict
  Cyrillic_20th_Century.dict
  Gothic_Letter_1622.dict
  NB_Norway_Koren.dict
  
  - Done
 $ '''New version using modelID: do_listHtrRnn.py --colid=COLID'''
 $ python ../../../Local/TranskribusPyClient/src/TranskribusCommands/do_listHtrRnn.py  --colid=6722
 385     ABP KWS Test    #Wed Jun 07 17:57:19 CEST 2017
 Learning\ Rate=2e-3
 Nr.\ of\ Epochs=200
 Train\ Size\ per\ Epoch=1000
 Noise=both
 133     English Writing M1      no params

Train an HTR RNN Model

Train an HTR (RNN) model for a given collection. Training st and trest set are slected using the --trdoc and tsdoc options

do_htrTrainRnn.py

 do_htrTrainRnn.py <MODELNAME> <COLID> --trdoc=DOCID/PAGERAGE (--trdoc=...) --tsdoc=DOCID/PAGERAGE (--tsdoc=...)  --epoch=N[,N] --batch=N[,N] --lr=F[,F]


This command creates 8 jobs with the following parameters:

 2e-3 200 200
 2e-3 500 200
 5e-3 200 200
 5e-3 500 200
 2e-3 200 400
 2e-3 500 400
 5e-3 200 400
 5e-3 500 400

apply an HTR RNN Model

Apply an HTR RNN model and dictionary onto the pages of a document: do_htrRnn.py. Update: htrRnnDecode now works at transcript and regions id level. With this command line, the last transcript of the page is taken if docid option is used. With --trp, the transcript referred in the trp file is used.

  • It now uses tempDict dictionaries; I will add an option for using commun dictionaries*




   $ do_htrRnn.py 899 abp_family.dict 5400 --docid=17442/1-2
   35442
   - Done
   $ do_htrRnn.py 899 abp_family.dict 5400 --trp=5400_17442_1.trp
   35443
   -Done
upload private 'temp' dictionaries

Upload a dictionary in your tmpDict folder (see your ftp folder at ftp://transbribus.eu) If several files are mentioned (several -d), they are concatenated. Demiliters are replaced by the ',' [comma] delimiter. Result can be seen in your tempDict folder. The new dictionary is named as <dictionary-name></dictionary-name>

   usage: '''do_uploadDictionary.py  &lt;dictionary&#45;name&gt;&lt;/dictionary&#45;name&gt; -d &lt;dictionary&#45;file&gt;&lt;/dictionary&#45;file&gt;'''
Get status of current job

In order to check the status of a Transkribus job, you can use do_getRnnTrainingJobStatus.py. Note that this only allows you to query your own jobs.

   usage : '''do_getRnnTrainingJobStatus.py &lt;jobid&gt;&lt;/jobid&gt;'''
   $ do_getRnnTrainingJobStatus.py 35442
   "Done"
Clone this wiki locally