-
Notifications
You must be signed in to change notification settings - Fork 14
Home
|
https://transkribus.eu/wiki/index.php/REST_Interface
https://transkribus.eu/TrpServer/Swadl/wadl.html
See in git DLA/src/read/TranskribusPyClient . The module client.py offers a subset of the server API.
In sub-package test, there are some example of use.
DLA/src/read/TranskribusCommands contains command line routines.
The Proxy can be indicated as usual in environment variables. (HTTPS_PROXY) Or it can be passed as parameters to the code. (See the constructor)
Pass your login/password as code parameters. Or consider having a Transkribus_credential.py file, where your login password are stored, like below:
# -*- coding: utf-8 -*- login = "[email protected]" password = "my-password-is-here"
Contact person: JL Meunier
In order to provide your Transkribus credentials to each command, there 3 possible ways:
- you create a Transkribus_credential.py module as explained in previous session (set the access right properly to protect your passsword!)
- you provide your Transkribus credentials at each command using the --login and --pwd options.
- you provide you credentials once and persist them using the --persist option . They are stored on disk with appropriate access rights, in a .trnskrbs folder.
do_login.py --persist --login <login> --pwd <password> #To use the persisted session, set the '''--persist''' option in next commands. #To clean the persistent session: do_logout.py
Command to add one or several documents to a target collection.
These objects are all specified by their unique identifier (a number).
NOTE: the documents are NOT duplicated! It is the same document that will appear in target collection in addition to some other collection(s).
USAGE: '''TranskribusCommands/do_addDocToCollec.py''' [--persist] <colId> [ <docId> | <docIdFrom>-<docIdTo> ]+ Documents are specified by a space-separated list of numbers, or number ranges, e.g. 3-36.
Command to duplicate one or several documents from a source collection to a target collection.
These objects are all specified by their unique identifier (a number).
NOTE 1: the new document inherits the name from the source collection.
NOTE 2: Access rights in source collection is required.
Usage: '''./TranskribusCommands/do_duplicateDoc.py''' [--persist] <from_colId> <to_colId> [ <docId> | <docIdFrom>-<docIdTo> ]+ Documents are specified by a space-separated list of numbers, or number ranges, e.g. 3-36.
Command to create one collection.
Usage: '''./TranskribusCommands/do_createCollec.py''' [--persist] <collection_name>
Command to delete one collection.
Usage: '''./TranskribusCommands/do_deleteCollec.py''' [--persist] <colId>
Command to list the content of a collection.
Usage: '''./TranskribusCommands/do_listCollec.py''' [--persist] <colId>
Command to selectively list or remove the transcripts of a document, or update their status.
This command works in 3 stages:
- Filtering: you can look at all transcripts per page or only the last one. Then you can filter based on the page number, or the transcript date, status, author. The command does a AND of all filters, in other words, a selected transcript satisfies all filters. After filtering, you can also keep only the last transcript per page.
- Checking: you can check that the transcripts selected by the filter verify certain conditions, based on transcript status and author. If the condition is not met for one or more selected transcript(s), the operation is not performed (apart the 'list' operation)
- Acting: an operation applies to the selected transcript, currently you can list, remove them or update their status.
You can either look at the last transcript of each page, before applying any filter, or after filtering, keep only the last transcript among those filtered.
To keep only the last transcript of each page before filtering:
--last
To keep only last transcript after filtering:
--last_filtered
You can provide a combination of single page numbers and intervals, for instance:
1 1,3 5-7 1,3,5-7,9
You can provide zero to many time intervals, which are either closed intervals or half-bounded closed intervals. A point in time is specified according to ISO 8601, e.g. 2017-02-05T22:43:56+0200 or 2017-02-05T21:43:56Z. Note that the timezone must be specified (Z denotes UTC timezone). By default the command will use the timezone of your machine, unless you specifiy the option --utc. NOTE: the command ignores any daylight saving time (DST) practice. For example:
--after 2017-01-01T00:00:00Z --after 2017-01-01T00:00:00Z --before 2017:12:31T23:59:59Z --within 2017-01-01T00:00:00Z/2017:12:31T23:59:59Z --at 2017-01-01T00:00:00Z
Filter transcripts to keep only those whose status is one of the given status (it is a OR). To negate a status, prefix it with the / character (double a / to escape it).
Filter transcripts to keep only those whose status is one of the given status(es) (it is a OR) AND not any of the negated status(es), which are prefixed by '/'.
For instance, to filter transcript with status A or B and not C, specify: --status=A --status=B --status=/C
(Double a / to escape it.)
Filter to keep only the transcript with stats NEW or IN_PROGRESS:
--status NEW --status IN_PROGRESS
Check that the status of any selected transcript is one of those given (it is a OR).
--check_status NEW --check_status NEW --check_status IN_PROGRESS
Check that the status of any selected transcript is one of those given (it is a OR) and not 'FOO'.
--check_status NEW --check_status IN_PROGRESS --check_status /FOO
Filter transcripts to keep only those whose author is one of the given user (it is a OR) AND not any of the negated users, whose names are prefixed by '/'.
For instance, to filter transcript authored by A or B and not C, specify: --user=A --user=B --user=/C
(Double a / to escape it.)
Filter to keep only transcripts authored by either Jean-Luc or Hervé:
--user [email protected] --user [email protected]
Check that the author of any selected transcript is one of those given users (it is a OR).
--check_user [email protected]
Filter to keep only transcripts authored by either Jean-Luc or Hervé, and not those authored by another person:
--user [email protected] --user [email protected] --user /[email protected]
Using the --trp opton, you produce a TRP file that reflects the selected transcripts. This file can be used to download those transcripts and only those.
By default, the command does a --list. But it can --remove (CAUTION!) the transcript selected by your filter(s), if they meet all the check(s) you specified. It can also update the status of the selected transcripts --set_status <status></status>.
This command first filters transcripts based on user specification, before checking user's specification on filtered transcript. Eventually, the retrieved transcripts are listed, or removed, or their status is updated. Page range is a comma-separated series of integer or pair of integers separated by a '-' For instance 1 or 1,3 or 1-4 or 1,3-6,8 Date takes the form: YYYY-MM-DDThh:mm:ss+HHMM like 2017-09-04T18:30:20+0100 YYYY-MM-DDThh:mm:ss-HHMM like 2017-09-04T18:30:20-0100 YYYY-MM-DDThh:mm:ssZ like 2017-09-04T18:30:20Z Incomplete dates are converted into the first millisecond of the given period. For instance 2017 is equivalent to 2017-01-01T00:00:00 Alternatively, it can be a timestamp (number of milliseconds since 1970-01-01) --utc option will show UTC times Managiong the transcripts of one or several document(s) or of a whole collection. Pass your login/password as options otherwise consider having a Transkribus_credential.py file, which defines a 'login' and a 'pwd' variables. If you need to use a proxy, use the --https_proxy option or set the environment variables HTTPS_PROXY. To use HTTP Basic Auth with your proxy, use the http://user:password@host/ syntax. Options: --version show program's version number and exit -h, --help show this help message and exit -s SERVER, --server=SERVER Transkribus server URL -l LOGIN, --login=LOGIN Transkribus login (consider storing your credentials in 'transkribus_credentials.py') -p PWD, --pwd=PWD Transkribus password --persist Try using an existing persistent session, or log-in and persists the session. --https_proxy=HTTPS_PROXY proxy, e.g. http://cornillon:8000 --last filter (i.e. keep) only last transcript of each page before any filtering occurs. --after=AFTER filter (i.e. keep) transcripts created on or after this date. --before=BEFORE filter (i.e. keep) transcripts created on or before this date. --within=WITHIN filter (i.e. keep) transcripts created within this range(s) of dates. --at=AT filter (i.e. keep) transcripts created at a date(s). --user=USER filter (i.e. keep) transcripts that were authored by this or these users. --status=STATUS filter (i.e. keep) transcripts that have this or these status(es). --last_filtered filter (i.e. keep) only last transcript, if any, of each page (done after any other filter). --check_user=CHECK_USER Check that each filtered transcript was authored by one of these users. --check_status=CHECK_STATUS Check that each filtered transcript have on of these statuses. --utc Show UTC time. --trp=FILENAME Store the TRP data reflecting the filtered transcripts in the given file. --list List the filtered transcripts. --rm Remove the filtered transcripts. (CAUTION) --set_status=SET_STATUS Set the filtered transcripts' status.
Utility to download a full collection from Transkribus and store it in a conventional DU folder structure. PageXml XMLs (one XML file per page), and optionally the images, are downloaded. A multi-page XML is created per document.
In addition a "multi-page" PageXml file is generated for each document. (PageXml is a single page standard, so we changed it, see in: read.xml_formats.multipagecontent.xsd)
Viewing those xml files (.mpxml and .pxml) is possible using mpxml_viewer and its specific .ini configuration file. See in read.visu.
NOTE: this downloader is lazy and will download the content of a document only if the document timestamp on server is more recent than the one on disk. On the other hand, when renewing the content of a document on disk, it downloads again the whole contents (xml and images), irrespective on which page or transcript was modified
In --trp mode, the download is not lazy and does not generate a multi-page PageXml. --force in this mode will overwrite any data.
> python C:\Local\meunier\git\DLA\src\read\TranskribusCommands\Transkribus_downloader.py 3571 - Done - Downloading collection 3571 to folder . - creating folder: .\trnskrbs_3571 INFO:root:- downloading collection 3571 into folder .\trnskrbs_3571\col (bForce=False) INFO:root:- downloading collection 3571, document 7749 into folder .\trnskrbs_3571\col\7749 (bForce=False) INFO:root:- DONE (downloaded collection 3571, document 7750 into folder .\trnskrbs_3571\col\7750 (bForce=False)) INFO:root:- DONE (downloaded collection 3571 into folder .\trnskrbs_3571\col (bForce=False)) - Done - Generating multi_page PageXml - .\trnskrbs_3571\col\7750 - .\trnskrbs_3571\col\7750.mpxml - .\trnskrbs_3571\col\7749 - .\trnskrbs_3571\col\7749.mpxml - Done, see in .\trnskrbs_3571
Utility to upload a full collection to Transkribus from a conventional DU folder structure created by the transkribus-downloader. This tool uploads the PageXml files (one XML file per page, i.e. the .pxml files).
In standard mode, the last transcripts of each page is uploaded, thanks to the trp.json files stored by the downloader to guide the upload. SO KEEP FILES CONSISTENCY!
In particular, the status of the uploaded transcript will be the same as its parent, as indicated in the trp.json file. The --trp option allows you to use another TRP file of your choice, typically one generated using do_Transcript.py command.
The --set_status option allows you to force the given status to the newly uploaded transcripts. (The status must be one of the Transkribus known statuses.)
Upload the transcript(s) from the DS structure to Transkribus, either of the collection or one of its document(s). The <directory></directory> must have been created by transkribus_downloader.py and should contain the 'col' directory and a trp.json file for the collection, and one per document (the 'out', 'ref', 'run', 'xml' folders are not used). The page transcript from the single page PageXml files are uploaded. (The multi-page xml file(s) are ignored)) Pass your login/password as options otherwise consider having a Transkribus_credential.py file, which defines a 'login' and a 'pwd' variables. If you need to use a proxy, use the --https_proxy option or set the environment variables HTTPS_PROXY. To use HTTP Basic Auth with your proxy, use the http://user:password@host/ syntax. Options: --version show program's version number and exit -h, --help show this help message and exit -s SERVER, --server=SERVER Transkribus server URL -l LOGIN, --login=LOGIN Transkribus login (consider storing your credentials in 'transkribus_credentials.py') -p PWD, --pwd=PWD Transkribus password --persist Try using an existing persistent session, or log-in and persists the session. --https_proxy=HTTPS_PROXY proxy, e.g. http://cornillon:8000 -q, --quiet Quiet mode --trp=TRP download the content specified by the trp file. --toolname=TOOL Set the Toolname metadata in Transkribus. --message=TEXT Set the Message metadata in Transkribus.
Utility to upload the transcripts from a MultiPageXml XML file to a Transkribus collection. The MultiPageXml is plit into PageXMl single-page transcripts, which are then uploaded to become a new version of the transcript for each page of the document(s).
This utility expects to work on complete documents from the folders created by Transkribus_downloader.py
Beware: do not use the --trp option at download because then, uploading the multi-page xml way will overlay any pages in the .mxml starting from page 1.
Usage:''' Transkribus_transcriptUploader.py [--toolname] [--message] <directory></directory> <coldid></coldid> [<docid></docid>]'''
Analyze the layout of a page by batch method. Currently, this is a pre-requisite before doing HTR.
usage : '''do_analyseLayout.py <colid></colid> <docid></docid> [<pages></pages>] <donotblockseq></donotblockseq> <donotlineseg></donotlineseg>''' '''by default blocks and lines seg are performed. Indicate (0=False/1=True) which actions will be performed''' Display the job ID
do_analyseLayout.py 1949 10462 4 35531 - Done
Analyze the layout (create lines and regions) of a doc/pages. Currently, this is a pre-requisite before doing HTR.
usage : '''do_analyseLayoutNew.py <colid></colid> ''' -r REGION, --region=REGION apply Layout Analysis (textLine) --trp=TRP_DOC use trp doc file --docid=DOCID document/pages to be analyzed --doRegionSeg do Region detection --batchjob do one job per page
Display the job ID
do_analyseLayoutNew.py 1949 --docid=10462/4 35531 - Done
Analyze the layout of a page by batch method. Currently, this is a pre-requisite before doing HTR.
usage : '''do_analyseLayoutBatch.py <colid></colid> <docid></docid> [<pages></pages>] <donotblockseq></donotblockseq> <donotlineseg></donotlineseg>''' '''by default blocks and lines seg are performed. Indicate (0=False/1=True) which actions will be performed''' Display the job ID
do_analyseLayoutBatch.py 1949 10462 4 35531 - Done
Usage: do_tableTemplate.py <colid></colid> <docid/pagerange></docid/pagerange>
do_tableTemplate.py --templateID 6163718 27593 103626 job ID:['435759'] - Done
List the HTR HMM models:
- Done
Apply an HTR HMM models onto the pages of a document: do_htrHmm.py
usage : '''do_htrHmm.py <model-name></model-name> <colid></colid> <docid></docid> [<pages></pages>]'''
do_htr.py Wydemann 3829 8620 1 - Done 35313
List the names of the RNN HTR models and the names of the dictionaries. do_listHtrRnn.py
--- Models --------------------------- 20160408_htrts_midfinal_11.sprnn meganet_hist_01_crx.sprnn meganet_us1900_05_us_an_crx.sprnn meganet_usaddr_12_pp_crx.sprnn net_160201_trained_noise.sprnn net_fraktur_0000_2000.sprnn South_Carolina_1720.sprnn GEO_1-3.sprnn GEO_1-3_v2.sprnn Reichsgericht_v4.sprnn IO_Botany_v1.sprnn Resolutions_v1.sprnn Bozen.sprnn escher_v3.sprnn Frisch-Sklaverei.sprnn IO_Botany_v2.sprnn Konzilsprotokolle_v1.sprnn hervetest.sprnn.sprnn StAZH_v1.sprnn Cyrillic_20th_Century.sprnn Hyde_Reel_1_Session_2.sprnn Gothic_Letter_1622.sprnn NB_Norway_Koren.sprnn Egypt_diary.sprnn Sutor.sprnn --- Dictionaries --------------------- alvermann_train.dict deutsch.dict deutscheNachnamen.dict eng.dict fracture.dict frau.dict htrts15_all_sorted.dict mann.dict Bozen_v1.dict Resolutions_v1.dict Reichsgericht_v1.dict StAZH_v1.dict Cyrillic_20th_Century.dict Gothic_Letter_1622.dict NB_Norway_Koren.dict - Done
$ '''New version using modelID: do_listHtrRnn.py --colid=COLID'''
$ python ../../../Local/TranskribusPyClient/src/TranskribusCommands/do_listHtrRnn.py --colid=6722 385 ABP KWS Test #Wed Jun 07 17:57:19 CEST 2017 Learning\ Rate=2e-3 Nr.\ of\ Epochs=200 Train\ Size\ per\ Epoch=1000 Noise=both 133 English Writing M1 no params
Train an HTR (RNN) model for a given collection. Training st and trest set are slected using the --trdoc and tsdoc options
do_htrTrainRnn.py
do_htrTrainRnn.py <MODELNAME> <COLID> --trdoc=DOCID/PAGERAGE (--trdoc=...) --tsdoc=DOCID/PAGERAGE (--tsdoc=...) --epoch=N[,N] --batch=N[,N] --lr=F[,F]
This command creates 8 jobs with the following parameters:
2e-3 200 200 2e-3 500 200 5e-3 200 200 5e-3 500 200 2e-3 200 400 2e-3 500 400 5e-3 200 400 5e-3 500 400
Apply an HTR RNN model and dictionary onto the pages of a document: do_htrRnn.py. Update: htrRnnDecode now works at transcript and regions id level. With this command line, the last transcript of the page is taken if docid option is used. With --trp, the transcript referred in the trp file is used.
- It now uses tempDict dictionaries; I will add an option for using commun dictionaries*
$ do_htrRnn.py 899 abp_family.dict 5400 --docid=17442/1-2 35442 - Done $ do_htrRnn.py 899 abp_family.dict 5400 --trp=5400_17442_1.trp 35443 -Done
Upload a dictionary in your tmpDict folder (see your ftp folder at ftp://transbribus.eu) If several files are mentioned (several -d), they are concatenated. Demiliters are replaced by the ',' [comma] delimiter. Result can be seen in your tempDict folder. The new dictionary is named as <dictionary-name></dictionary-name>
usage: '''do_uploadDictionary.py <dictionary-name></dictionary-name> -d <dictionary-file></dictionary-file>'''
In order to check the status of a Transkribus job, you can use do_getRnnTrainingJobStatus.py. Note that this only allows you to query your own jobs.
usage : '''do_getRnnTrainingJobStatus.py <jobid></jobid>'''
$ do_getRnnTrainingJobStatus.py 35442 "Done"
The API includes lock-related methods (listPageLocks, lockPage(<bool></bool>), isPageLocked). Should we lock a page we work on? Like when a DU tool does automatic annotation?
If some DU tools does automatic annotation, which status should we set? For now on, we don't set it, and it becomes "in progress", which is ok BTW.
If we generate a ML model for a collection, can we store it somewhere in Transkribus and associate it with the collection?