Skip to content

Miracle

gramirez-prompsit edited this page May 31, 2021 · 2 revisions

Give miracle.py a small input corpus, mono or parallel, and get a larger amount of similar sentences specially selected from big corpora.

Content

Usage

python3.7 miracle.py --help
usage: miracle.py [-h] [-q] [--debug] [--logfile LOGFILE] [-v] config

positional arguments:
  config             Config yaml file

optional arguments:
  -h, --help         show this help message and exit

Logging:
  -q, --quiet        Silent logging mode (default: False)
  --debug            Debug logging mode (default: False)
  --logfile LOGFILE  Store log to a file (default: <_io.TextIOWrapper
                     name='<stderr>' mode='w' encoding='UTF-8'>)
  -v, --version      Show version of this script and exit

Configuration

miracle.py reads the configuration from a .yaml file, keys being the following:

  • input: Input file. One sentence per line (tab-separated in case it's parallel text)
  • output: Output file.
  • sents: Amount of sentences to be collected for the output file.
  • collection: Solr collection to be queried for sentences.
  • lang: First language in the input file; also, first language in the Solr collection.
  • side: Side (src or trg) in which lang is located in collection.
  • lang2: Second language in the input file (if available); also, second language in the Solr collection.
  • side2: Side (src or trg) in which lang2 is located in collection.
  • isparallel: True if input is parallel, False otherwise.
  • outformat: Output file format. tsv (tab-separated values) or tmx (translation memory exchange format)

For example, climbing.config:

input: climbing.txt
output: climbing.out
sents: 10000
collection: paracrawl-en-de
lang: en
side: src
lang2: de
side2: trg
isparallel: False
outformat: tsv

Running python3.7 miracle.py climbing.config will build a tab-separated corpus of 10,000 en-de sentences from the Paracrawl EN-DE corpus, based on the sentences on climbing.txt (monolingual text in English), storing it in climbing.out.

Another example:

input: patents.en-es
output: patents.out
sents: 100000
collection: paracrawl-en-es
lang: en
side: src
lang2: es
side2: trg
outformat: tmx
isparallel: True

This will build a TMX file, containing 100,000 sentences from ParaCrawl EN-ES corpus, based on the bilingual patents.en-es sample corpus

One last example:

input: godspeed.txt
output: godspeed.out
sents: 10000
collection: paracrawl-en-es
lang: es
side: trg
lang2: en
side2: src
isparallel: False
outformat: tsv

This configuration will make miracle.py to build a tab-separated file of 10,000 sentences from Paracrawl EN-ES corpus, based on godspeed.txt sample text which contains Spanish monolingual text.

Algorithm overview

ngrammer

The first step is the extraction of all posible n-grams (orders 8 to 1, using nltk) from the input text. If a candidate n-gram contains punctuation or all of its tokens are stopwords, it's rejected. n-grams are then sorted by frequency, and assigned a score. The score is the product of the order of the n-gram and the frequency of occurrence in the input text. Higher scores mean that the n-gram is more relevant.

Grouping

Once all n-grams are extracted and sorted by score, they are grouped in four groups, each of them containing 25% of the scores (not the n-grams). The first three groups (i.e. the top 75% of scores) will be in the "regular group". The fourth group is called the "emergency group", because we'll only use it in case the regular group has been fully queried (more on this below).

If the input is parallel, separated groups are built for source and target.

After grouping, scores that contains more than 200 n-grams are split in batches of 200 n-grams, in order to avoid Solr querying limits.

Querying solr

Given that we want to retrieve a total amount of sentences sents, we "distribute" that amount between all the groups of the "regular group" dividing sents by the amount of scoring groups. This value is the base "sentences per page" (spg) (when a group has different batches, each batch gets an equal portion of the spg).

Once obtained the base spg, a booster is applied: the top 25% of groups get a x4 booster (meaning that Solr will be queried for 4xspg sentences from this groups), 25%-50% get a x2, and 50%-75% group gets no booster. Emergency group, if needed, does also get no booster at all. Thus, in the best case, querying the top-25% with a x4 booster will result in the total amount of desired sentences.

In short: for each scoring group (or batch, if the group is too large), a single query to Solr is performed, with the different n-grams in the group "joined" with OR clauses, for a amount of spgxbooster sentences.

If the amount of retrieved results for a given group is below the spgxbooster, it means that there are no more results for this group (or batch), and we mark it as "done", and won't be queried in future rounds.

When all groups are queried after a first round, we make use of the pagination feature of Solr, querying for the next page of results, until all groups are marked as done, or the desired amount of sentences is obtained.

In order to avoid duplicates, the identifier of each Solr entry is kept in a set, discarding sentences that had already been added to the output.

In case that the regular group has been fully queried and the desired amount of output sentences has not been met, the same querying process is repeated but for the emergency group.

In case of a parallel input query text, half of the total amount of results is retrieved for the source, and the other half for the target.

Similarity measure

In order to get a sense of how similar the output and input sentences are, a "similarity measure" is provided.

We first calculate the best case, in which all the retrieved sentences come from the top-25% group, with a x4 booster. We multiply the score of each group by 4xspg, and sum it for all groups (in the top-25%). This will be the theoretical max. similarity.

Then, when querying Solr, for each retrieved sentence we accumulate the score of the group it came from. This value will be the accumulated similarity.

Then, we return similarity he accumulated similarity in relation to the max. similarity. For example, if the max similarity is 1000, and the accumulated similarity is 800, similarity is 80%.

Example

Config file

input: tiny-test.txt
output: tiny-test.out
sents: 15
collection: paracrawl-en-es
lang: en
side: src
lang2: es
side2: trg
isparallel: False
outformat: tsv

Input file

Twined knitting is an traditional Scandinavian knitting technique dating back at least to the 17th century in Sweden.
Some beautiful, subtle patterns can be made in twined knitting using purled stitches -either purling alternately with both strands, or knitting with one strand and purling the other.
To prepare the yarn for knitting first be sure it's a center pull ball, then find both ends.
and use it to knit the next stitch...
Pull the two strands out a good length before beginning to knit.
Then knit one stitch as you normally would...

Command run

python3.7 miracle.py tiny-test.config 

Output file

You can also download our knitting pattern below the text for free.     También puede descargar nuestro patrón de tejido debajo del texto de forma gratuita.
Many people say that this type of technique combines crochet and knitting.     Muchas personas dicen que este tipo de técnica combina ganchillo y tejido.
As knitting substance it apply in the simple and mixed clay solutions.  En calidad de la sustancia que teje la aplican en las soluciones simples y mezcladas de barro.
Knitting is a very exciting and at the same time useful activity.       Tejer es una actividad muy emocionante y al mismo tiempo útil.
The four different knitting patterns make your baby blanket an absolute gem.    Los cuatro patrones de tejido diferentes hacen que la manta de tu bebé sea una joya absoluta.
As a result production is less standardized, requiring different types of knitting machines and a greater number of workers employed.   Como consecuencia de ello, la producción está menos uniformizada y requiere distintos tipos de máquinas de tejer y un número mayor de trabajadores.
Cells called fibroblasts are responsible for knitting together wounds of the skin.     Las células llamadas los fibroblastos son responsables de hacer punto juntas hieren de la piel.
The most obvious choice for someone to explore alongside crochet is knitting. La opción más obvia para alguien explorar junto con ganchillo es tejer.
Plants absorb light energy from the sun and use it to produce glucose.  Las plantas absorben la energía luminosa del sol y la usan para producir glucosa.
RNA can be made in the laboratory and used in research studies. El ARN se puede producir en el laboratorio y se usa en estudios de investigación.
Before applying, research the company to make sure it's a good fit.     Antes de postularse, investiguen a la compañía para asegurarse de que sea conveniente.
When a problem is faced decisions regarding effective action can be made.       Cuando se enfrenta un problema, se pueden tomar decisiones respecto de una acción efectiva.
Absolutely gorgeous and it's a perfect meal for a health conscious family.      Absolutamente magnífico y es una comida perfecta para una familia consciente de la salud.
It's a good idea to double-check information you read on commercial websites.   Es una buena idea verificar la información que usted lee en sitios Web comerciales.
The cold winter weather is coming, an ideal time for knitting socks.    No Comments Llega el frío invierno, momento ideal para tejer calcetines.

About the name

Have your sentences multiplied: miracle.py is like the loaves and fish miracle -but with text!

Jesus asked them, “How many loaves have you?” They said, “Seven, and a few small fish.” Then ordering the crowd to sit down on the ground, he took the seven loaves and the fish; and after giving thanks he broke them and gave them to the disciples, and the disciples gave them to the crowds. And all of them ate and were filled; and they took up the broken pieces left over, seven baskets full.

Matthew 15:32-39

Clone this wiki locally