-
Notifications
You must be signed in to change notification settings - Fork 2
/
README
4612 lines (3359 loc) · 199 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
More documentation can be found in the doc/ directory or at http://www.recoll.org
Recoll user manual
Jean-Francois Dockes
Copyright (c) 2005-2015 Jean-Francois Dockes
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.3 or any
later version published by the Free Software Foundation; with no Invariant
Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the
license can be found at the following location: GNU web site.
This document introduces full text search notions and describes the
installation and use of the Recoll application. This version describes
Recoll 1.21.
----------------------------------------------------------------------
Table of Contents
1. Introduction
1.1. Giving it a try
1.2. Full text search
1.3. Recoll overview
2. Indexing
2.1. Introduction
2.1.1. Indexing modes
2.1.2. Configurations, multiple indexes
2.1.3. Document types
2.1.4. Indexing failures
2.1.5. Recovery
2.2. Index storage
2.2.1. Xapian index formats
2.2.2. Security aspects
2.3. Index configuration
2.3.1. Multiple indexes
2.3.2. Index case and diacritics sensitivity
2.3.3. The index configuration GUI
2.4. Indexing WEB pages you wisit
2.5. Extended attributes data
2.6. Importing external tags
2.7. Periodic indexing
2.7.1. Running indexing
2.7.2. Using cron to automate indexing
2.8. Real time indexing
2.8.1. Slowing down the reindexing rate for fast
changing files
3. Searching
3.1. Searching with the Qt graphical user interface
3.1.1. Simple search
3.1.2. The default result list
3.1.3. The result table
3.1.4. Running arbitrary commands on result
files (1.20 and later)
3.1.5. Displaying thumbnails
3.1.6. The preview window
3.1.7. The Query Fragments window
3.1.8. Complex/advanced search
3.1.9. The term explorer tool
3.1.10. Multiple indexes
3.1.11. Document history
3.1.12. Sorting search results and collapsing
duplicates
3.1.13. Search tips, shortcuts
3.1.14. Saving and restoring queries (1.21 and
later)
3.1.15. Customizing the search interface
3.2. Searching with the KDE KIO slave
3.2.1. What's this
3.2.2. Searchable documents
3.3. Searching on the command line
3.4. Path translations
3.5. The query language
3.5.1. Modifiers
3.6. Search case and diacritics sensitivity
3.7. Anchored searches and wildcards
3.7.1. More about wildcards
3.7.2. Anchored searches
3.8. Desktop integration
3.8.1. Hotkeying recoll
3.8.2. The KDE Kicker Recoll applet
4. Programming interface
4.1. Writing a document input handler
4.1.1. Simple input handlers
4.1.2. "Multiple" handlers
4.1.3. Telling Recoll about the handler
4.1.4. Input handler HTML output
4.1.5. Page numbers
4.2. Field data processing
4.3. API
4.3.1. Interface elements
4.3.2. Python interface
5. Installation and configuration
5.1. Installing a binary copy
5.2. Supporting packages
5.3. Building from source
5.3.1. Prerequisites
5.3.2. Building
5.3.3. Installation
5.4. Configuration overview
5.4.1. Environment variables
5.4.2. The main configuration file, recoll.conf
5.4.3. The fields file
5.4.4. The mimemap file
5.4.5. The mimeconf file
5.4.6. The mimeview file
5.4.7. The ptrans file
5.4.8. Examples of configuration adjustments
Chapter 1. Introduction
1.1. Giving it a try
If you do not like reading manuals (who does?) but wish to give Recoll a
try, just install the application and start the recoll graphical user
interface (GUI), which will ask permission to index your home directory by
default, allowing you to search immediately after indexing completes.
Do not do this if your home directory contains a huge number of documents
and you do not want to wait or are very short on disk space. In this case,
you may first want to customize the configuration to restrict the indexed
area (for the very impatient with a completed package install, from the
recoll GUI: Preferences -> Indexing configuration, then adjust the Top
directories section).
Also be aware that you may need to install the appropriate supporting
applications for document types that need them (for example antiword for
Microsoft Word files).
1.2. Full text search
Recoll is a full text search application. Full text search finds your data
by content rather than by external attributes (like a file name). You
specify words (terms) which should or should not appear in the text you
are looking for, and receive in return a list of matching documents,
ordered so that the most relevant documents will appear first.
You do not need to remember in what file or email message you stored a
given piece of information. You just ask for related terms, and the tool
will return a list of documents where these terms are prominent, in a
similar way to Internet search engines.
Full text search applications try to determine which documents are most
relevant to the search terms you provide. Computer algorithms for
determining relevance can be very complex, and in general are inferior to
the power of the human mind to rapidly determine relevance. The quality of
relevance guessing is probably the most important aspect when evaluating a
search application.
In many cases, you are looking for all the forms of a word, including
plurals, different tenses for a verb, or terms derived from the same root
or stem (example: floor, floors, floored, flooring...). Queries are
usually automatically expanded to all such related terms (words that
reduce to the same stem). This can be prevented for searching for a
specific form.
Stemming, by itself, does not accommodate for misspellings or phonetic
searches. A full text search application may also support this form of
approximation. For example, a search for aliterattion returning no result
may propose, depending on index contents, alliteration alteration
alterations altercation as possible replacement terms.
1.3. Recoll overview
Recoll uses the Xapian information retrieval library as its storage and
retrieval engine. Xapian is a very mature package using a sophisticated
probabilistic ranking model.
The Xapian library manages an index database which describes where terms
appear in your document files. It efficiently processes the complex
queries which are produced by the Recoll query expansion mechanism, and is
in charge of the all-important relevance computation task.
Recoll provides the mechanisms and interface to get data into and out of
the index. This includes translating the many possible document formats
into pure text, handling term variations (using Xapian stemmers), and
spelling approximations (using the aspell speller), interpreting user
queries and presenting results.
In a shorter way, Recoll does the dirty footwork, Xapian deals with the
intelligent parts of the process.
The Xapian index can be big (roughly the size of the original document
set), but it is not a document archive. Recoll can only display documents
that still exist at the place from which they were indexed. (Actually,
there is a way to reconstruct a document from the information in the
index, but the result is not nice, as all formatting, punctuation and
capitalization are lost).
Recoll stores all internal data in Unicode UTF-8 format, and it can index
files of many types with different character sets, encodings, and
languages into the same index. It can process documents embedded inside
other documents (for example a pdf document stored inside a Zip archive
sent as an email attachment...), down to an arbitrary depth.
Stemming is the process by which Recoll reduces words to their radicals so
that searching does not depend, for example, on a word being singular or
plural (floor, floors), or on a verb tense (flooring, floored). Because
the mechanisms used for stemming depend on the specific grammatical rules
for each language, there is a separate Xapian stemmer module for most
common languages where stemming makes sense.
Recoll stores the unstemmed versions of terms in the main index and uses
auxiliary databases for term expansion (one for each stemming language),
which means that you can switch stemming languages between searches, or
add a language without needing a full reindex.
Storing documents written in different languages in the same index is
possible, and commonly done. In this situation, you can specify several
stemming languages for the index.
Recoll currently makes no attempt at automatic language recognition, which
means that the stemmer will sometimes be applied to terms from other
languages with potentially strange results. In practise, even if this
introduces possibilities of confusion, this approach has been proven quite
useful, and it is much less cumbersome than separating your documents
according to what language they are written in.
Before version 1.18, Recoll stripped most accents and diacritics from
terms, and converted them to lower case before either storing them in the
index or searching for them. As a consequence, it was impossible to search
for a particular capitalization of a term (US / us), or to discriminate
two terms based on diacritics (sake / sake, mate / mate).
As of version 1.18, Recoll can optionally store the raw terms, without
accent stripping or case conversion. In this configuration, it is still
possible (and most common) for a query to be insensitive to case and/or
diacritics. Appropriate term expansions are performed before actually
accessing the main index. This is described in more detail in the section
about index case and diacritics sensitivity.
Recoll has many parameters which define exactly what to index, and how to
classify and decode the source documents. These are kept in configuration
files. A default configuration is copied into a standard location (usually
something like /usr/[local/]share/recoll/examples) during installation.
The default values set by the configuration files in this directory may be
overridden by values that you set inside your personal configuration,
found by default in the .recoll sub-directory of your home directory. The
default configuration will index your home directory with default
parameters and should be sufficient for giving Recoll a try, but you may
want to adjust it later, which can be done either by editing the text
files or by using configuration menus in the recoll GUI. Some other
parameters affecting only the recoll GUI are stored in the standard
location defined by Qt.
The indexing process is started automatically the first time you execute
the recoll GUI. Indexing can also be performed by executing the
recollindex command. Recoll indexing is multithreaded by default when
appropriate hardware resources are available, and can perform in parallel
multiple tasks among text extraction, segmentation and index updates.
Searches are usually performed inside the recoll GUI, which has many
options to help you find what you are looking for. However, there are
other ways to perform Recoll searches: mostly a command line interface, a
Python programming interface, a KDE KIO slave module, and Ubuntu Unity
Lens (for older versions) or Scope (for current versions) modules.
Chapter 2. Indexing
2.1. Introduction
Indexing is the process by which the set of documents is analyzed and the
data entered into the database. Recoll indexing is normally incremental:
documents will only be processed if they have been modified since the last
run. On the first execution, all documents will need processing. A full
index build can be forced later by specifying an option to the indexing
command (recollindex -z or -Z).
recollindex skips files which caused an error during a previous pass. This
is a performance optimization, and a new behaviour in version 1.21 (failed
files were always retried by previous versions). The command line option
-k can be set to retry failed files, for example after updating a filter.
The following sections give an overview of different aspects of the
indexing processes and configuration, with links to detailed sections.
Depending on your data, temporary files may be needed during indexing,
some of them possibly quite big. You can use the RECOLL_TMPDIR or TMPDIR
environment variables to determine where they are created (the default is
to use /tmp). Using TMPDIR has the nice property that it may also be taken
into account by auxiliary commands executed by recollindex.
2.1.1. Indexing modes
Recoll indexing can be performed along two different modes:
o Periodic (or batch) indexing: indexing takes place at discrete times,
by executing the recollindex command. The typical usage is to have a
nightly indexing run programmed into your cron file.
o Real time indexing: indexing takes place as soon as a file is created
or changed. recollindex runs as a daemon and uses a file system
alteration monitor such as inotify, Fam or Gamin to detect file
changes.
The choice between the two methods is mostly a matter of preference, and
they can be combined by setting up multiple indexes (ie: use periodic
indexing on a big documentation directory, and real time indexing on a
small home directory). Monitoring a big file system tree can consume
significant system resources.
The choice of method and the parameters used can be configured from the
recoll GUI: Preferences -> Indexing schedule
2.1.2. Configurations, multiple indexes
The parameters describing what is to be indexed and local preferences are
defined in text files contained in a configuration directory.
All parameters have defaults, defined in system-wide files.
Without further configuration, Recoll will index all appropriate files
from your home directory, with a reasonable set of defaults.
A default personal configuration directory ($HOME/.recoll/) is created
when a Recoll program is first executed. It is possible to create other
configuration directories, and use them by setting the RECOLL_CONFDIR
environment variable, or giving the -c option to any of the Recoll
commands.
In some cases, it may be interesting to index different areas of the file
system to separate databases. You can do this by using multiple
configuration directories, each indexing a file system area to a specific
database. Typically, this would be done to separate personal and shared
indexes, or to take advantage of the organization of your data to improve
search precision.
The generated indexes can be queried concurrently in a transparent manner.
For index generation, multiple configurations are totally independent from
each other. When multiple indexes need to be used for a single search,
some parameters should be consistent among the configurations.
2.1.3. Document types
Recoll knows about quite a few different document types. The parameters
for document types recognition and processing are set in configuration
files.
Most file types, like HTML or word processing files, only hold one
document. Some file types, like email folders or zip archives, can hold
many individually indexed documents, which may themselves be compound
ones. Such hierarchies can go quite deep, and Recoll can process, for
example, a LibreOffice document stored as an attachment to an email
message inside an email folder archived in a zip file...
Recoll indexing processes plain text, HTML, OpenDocument
(Open/LibreOffice), email formats, and a few others internally.
Other file types (ie: postscript, pdf, ms-word, rtf ...) need external
applications for preprocessing. The list is in the installation section.
After every indexing operation, Recoll updates a list of commands that
would be needed for indexing existing files types. This list can be
displayed by selecting the menu option File -> Show Missing Helpers in the
recoll GUI. It is stored in the missing text file inside the configuration
directory.
By default, Recoll will try to index any file type that it has a way to
read. This is sometimes not desirable, and there are ways to either
exclude some types, or on the contrary to define a positive list of types
to be indexed. In the latter case, any type not in the list will be
ignored.
Excluding types can be done by adding wildcard name patterns to the
skippedNames list, which can be done from the GUI Index configuration
menu. For versions 1.20 and later, you can alternatively set the
excludedmimetypes list in the configuration file. This can be redefined
for subdirectories.
You can also define an exclusive list of MIME types to be indexed (no
others will be indexed), by setting the indexedmimetypes configuration
variable. Example:
indexedmimetypes = text/html application/pdf
It is possible to redefine this parameter for subdirectories. Example:
[/path/to/my/dir]
indexedmimetypes = application/pdf
(When using sections like this, don't forget that they remain in effect
until the end of the file or another section indicator).
excludedmimetypes or indexedmimetypes, can be set either by editing the
main configuration file (recoll.conf), or from the GUI index configuration
tool.
2.1.4. Indexing failures
Indexing may fail for some documents, for a number of reasons: a helper
program may be missing, the document may be corrupt, we may fail to
uncompress a file because no file system space is available, etc.
Recoll versions prior to 1.21 always retried to index files which had
previously caused an error. This guaranteed that anything that may have
become indexable (for example because a helper had been installed) would
be indexed. However this was bad for performance because some indexing
failures may be quite costly (for example failing to uncompress a big file
because of insufficient disk space).
The indexer in Recoll versions 1.21 and later do not retry failed file by
default. Retrying will only occur if an explicit option (-k) is set on the
recollindex command line, or if a script executed when recollindex starts
up says so. The script is defined by a configuration variable
(checkneedretryindexscript), and makes a rather lame attempt at deciding
if a helper command may have been installed, by checking if any of the
common bin directories have changed.
2.1.5. Recovery
In the rare case where the index becomes corrupted (which can signal
itself by weird search results or crashes), the index files need to be
erased before restarting a clean indexing pass. Just delete the xapiandb
directory (see next section), or, alternatively, start the next
recollindex with the -z option, which will reset the database before
indexing.
2.2. Index storage
The default location for the index data is the xapiandb subdirectory of
the Recoll configuration directory, typically $HOME/.recoll/xapiandb/.
This can be changed via two different methods (with different purposes):
o You can specify a different configuration directory by setting the
RECOLL_CONFDIR environment variable, or using the -c option to the
Recoll commands. This method would typically be used to index
different areas of the file system to different indexes. For example,
if you were to issue the following commands:
export RECOLL_CONFDIR=~/.indexes-email
recoll
Then Recoll would use configuration files stored in ~/.indexes-email/
and, (unless specified otherwise in recoll.conf) would look for the
index in ~/.indexes-email/xapiandb/.
Using multiple configuration directories and configuration options
allows you to tailor multiple configurations and indexes to handle
whatever subset of the available data you wish to make searchable.
o For a given configuration directory, you can specify a non-default
storage location for the index by setting the dbdir parameter in the
configuration file (see the configuration section). This method would
mainly be of use if you wanted to keep the configuration directory in
its default location, but desired another location for the index,
typically out of disk occupation concerns.
The size of the index is determined by the size of the set of documents,
but the ratio can vary a lot. For a typical mixed set of documents, the
index size will often be close to the data set size. In specific cases (a
set of compressed mbox files for example), the index can become much
bigger than the documents. It may also be much smaller if the documents
contain a lot of images or other non-indexed data (an extreme example
being a set of mp3 files where only the tags would be indexed).
Of course, images, sound and video do not increase the index size, which
means that nowadays (2012), typically, even a big index will be negligible
against the total amount of data on the computer.
The index data directory (xapiandb) only contains data that can be
completely rebuilt by an index run (as long as the original documents
exist), and it can always be destroyed safely.
2.2.1. Xapian index formats
Xapian versions usually support several formats for index storage. A given
major Xapian version will have a current format, used to create new
indexes, and will also support the format from the previous major version.
Xapian will not convert automatically an existing index from the older
format to the newer one. If you want to upgrade to the new format, or if a
very old index needs to be converted because its format is not supported
any more, you will have to explicitly delete the old index, then run a
normal indexing process.
Using the -z option to recollindex is not sufficient to change the format,
you will have to delete all files inside the index directory (typically
~/.recoll/xapiandb) before starting the indexing.
2.2.2. Security aspects
The Recoll index does not hold copies of the indexed documents. But it
does hold enough data to allow for an almost complete reconstruction. If
confidential data is indexed, access to the database directory should be
restricted.
Recoll (since version 1.4) will create the configuration directory with a
mode of 0700 (access by owner only). As the index data directory is by
default a sub-directory of the configuration directory, this should result
in appropriate protection.
If you use another setup, you should think of the kind of protection you
need for your index, set the directory and files access modes
appropriately, and also maybe adjust the umask used during index updates.
2.3. Index configuration
Variables set inside the Recoll configuration files control which areas of
the file system are indexed, and how files are processed. These variables
can be set either by editing the text files or by using the dialogs in the
recoll GUI.
The first time you start recoll, you will be asked whether or not you
would like it to build the index. If you want to adjust the configuration
before indexing, just click Cancel at this point, which will get you into
the configuration interface. If you exit at this point, recoll will have
created a ~/.recoll directory containing empty configuration files, which
you can edit by hand.
The configuration is documented inside the installation chapter of this
document, or in the recoll.conf(5) man page, but the most current
information will most likely be the comments inside the sample file. The
most immediately useful variable you may interested in is probably
topdirs, which determines what subtrees get indexed.
The applications needed to index file types other than text, HTML or email
(ie: pdf, postscript, ms-word...) are described in the external packages
section.
As of Recoll 1.18 there are two incompatible types of Recoll indexes,
depending on the treatment of character case and diacritics. The next
section describes the two types in more detail.
2.3.1. Multiple indexes
Multiple Recoll indexes can be created by using several configuration
directories which are usually set to index different areas of the file
system. A specific index can be selected for updating or searching, using
the RECOLL_CONFDIR environment variable or the -c option to recoll and
recollindex.
A typical usage scenario for the multiple index feature would be for a
system administrator to set up a central index for shared data, that you
choose to search or not in addition to your personal data. Of course,
there are other possibilities. There are many cases where you know the
subset of files that should be searched, and where narrowing the search
can improve the results. You can achieve approximately the same effect
with the directory filter in advanced search, but multiple indexes will
have much better performance and may be worth the trouble.
A recollindex program instance can only update one specific index.
The main index (defined by RECOLL_CONFDIR or -c) is always active. If this
is undesirable, you can set up your base configuration to index an empty
directory.
The different search interfaces (GUI, command line, ...) have different
methods to define the set of indexes to be used, see the appropriate
section.
If a set of multiple indexes are to be used together for searches, some
configuration parameters must be consistent among the set. These are
parameters which need to be the same when indexing and searching. As the
parameters come from the main configuration when searching, they need to
be compatible with what was set when creating the other indexes (which
came from their respective configuration directories).
Most importantly, all indexes to be queried concurrently must have the
same option concerning character case and diacritics stripping, but there
are other constraints. Most of the relevant parameters are described in
the linked section.
2.3.2. Index case and diacritics sensitivity
As of Recoll version 1.18 you have a choice of building an index with
terms stripped of character case and diacritics, or one with raw terms.
For a source term of Resume, the former will store resume, the latter
Resume.
Each type of index allows performing searches insensitive to case and
diacritics: with a raw index, the user entry will be expanded to match all
case and diacritics variations present in the index. With a stripped
index, the search term will be stripped before searching.
A raw index allows for another possibility which a stripped index cannot
offer: using case and diacritics to discriminate between terms, returning
different results when searching for US and us or resume and resume. Read
the section about search case and diacritics sensitivity for more details.
The type of index to be created is controlled by the indexStripChars
configuration variable which can only be changed by editing the
configuration file. Any change implies an index reset (not automated by
Recoll), and all indexes in a search must be set in the same way (again,
not checked by Recoll).
If the indexStripChars is not set, Recoll 1.18 creates a stripped index by
default, for compatibility with previous versions.
As a cost for added capability, a raw index will be slightly bigger than a
stripped one (around 10%). Also, searches will be more complex, so
probably slightly slower, and the feature is still young, so that a
certain amount of weirdness cannot be excluded.
One of the most adverse consequence of using a raw index is that some
phrase and proximity searches may become impossible: because each term
needs to be expanded, and all combinations searched for, the
multiplicative expansion may become unmanageable.
2.3.3. The index configuration GUI
Most parameters for a given index configuration can be set from a recoll
GUI running on this configuration (either as default, or by setting
RECOLL_CONFDIR or the -c option.)
The interface is started from the Preferences -> Index Configuration menu
entry. It is divided in four tabs, Global parameters, Local parameters,
Web history (which is explained in the next section) and Search
parameters.
The Global parameters tab allows setting global variables, like the lists
of top directories, skipped paths, or stemming languages.
The Local parameters tab allows setting variables that can be redefined
for subdirectories. This second tab has an initially empty list of
customisation directories, to which you can add. The variables are then
set for the currently selected directory (or at the top level if the empty
line is selected).
The Search parameters section defines parameters which are used at query
time, but are global to an index and affect all search tools, not only the
GUI.
The meaning for most entries in the interface is self-evident and
documented by a ToolTip popup on the text label. For more detail, you will
need to refer to the configuration section of this guide.
The configuration tool normally respects the comments and most of the
formatting inside the configuration file, so that it is quite possible to
use it on hand-edited files, which you might nevertheless want to backup
first...
2.4. Indexing WEB pages you wisit
With the help of a Firefox extension, Recoll can index the Internet pages
that you visit. The extension was initially designed for the Beagle
indexer, but it has recently be renamed and better adapted to Recoll.
The extension works by copying visited WEB pages to an indexing queue
directory, which Recoll then processes, indexing the data, storing it into
a local cache, then removing the file from the queue.
This feature can be enabled in the GUI Index configuration panel, or by
editing the configuration file (set processwebqueue to 1).
A current pointer to the extension can be found, along with up-to-date
instructions, on the Recoll wiki.
A copy of the indexed WEB pages is retained by Recoll in a local cache
(from which previews can be fetched). The cache size can be adjusted from
the Index configuration / Web history panel. Once the maximum size is
reached, old pages are purged - both from the cache and the index - to
make room for new ones, so you need to explicitly archive in some other
place the pages that you want to keep indefinitely.
2.5. Extended attributes data
User extended attributes are named pieces of information that most modern
file systems can attach to any file.
Recoll versions 1.19 and later process extended attributes as document
fields by default. For older versions, this has to be activated at build
time.
A freedesktop standard defines a few special attributes, which are handled
as such by Recoll:
mime_type
If set, this overrides any other determination of the file MIME
type.
charset
If set, this defines the file character set (mostly useful for
plain text files).
By default, other attributes are handled as Recoll fields. On Linux, the
user prefix is removed from the name. This can be configured more
precisely inside the fields configuration file.
2.6. Importing external tags
During indexing, it is possible to import metadata for each file by
executing commands. For example, this could extract user tag data for the
file and store it in a field for indexing.
See the section about the metadatacmds field in the main configuration
chapter for more detail.
2.7. Periodic indexing
2.7.1. Running indexing
Indexing is always performed by the recollindex program, which can be
started either from the command line or from the File menu in the recoll
GUI program. When started from the GUI, the indexing will run on the same
configuration recoll was started on. When started from the command line,
recollindex will use the RECOLL_CONFDIR variable or accept a -c confdir
option to specify a non-default configuration directory.
If the recoll program finds no index when it starts, it will automatically
start indexing (except if canceled).
The recollindex indexing process can be interrupted by sending an
interrupt (Ctrl-C, SIGINT) or terminate (SIGTERM) signal. Some time may
elapse before the process exits, because it needs to properly flush and
close the index. This can also be done from the recoll GUI File -> Stop
Indexing menu entry.
After such an interruption, the index will be somewhat inconsistent
because some operations which are normally performed at the end of the
indexing pass will have been skipped (for example, the stemming and
spelling databases will be inexistent or out of date). You just need to
restart indexing at a later time to restore consistency. The indexing will
restart at the interruption point (the full file tree will be traversed,
but files that were indexed up to the interruption and for which the index
is still up to date will not need to be reindexed).
recollindex has a number of other options which are described in its man
page. Only a few will be described here.
Option -z will reset the index when starting. This is almost the same as
destroying the index files (the nuance is that the Xapian format version
will not be changed).
Option -Z will force the update of all documents without resetting the
index first. This will not have the "clean start" aspect of -z, but the
advantage is that the index will remain available for querying while it is
rebuilt, which can be a significant advantage if it is very big (some
installations need days for a full index rebuild).
Option -k will force retrying files which previously failed to be indexed,
for example because of a missing helper program.
Of special interest also, maybe, are the -i and -f options. -i allows
indexing an explicit list of files (given as command line parameters or
read on stdin). -f tells recollindex to ignore file selection parameters
from the configuration. Together, these options allow building a custom
file selection process for some area of the file system, by adding the top
directory to the skippedPaths list and using an appropriate file selection
method to build the file list to be fed to recollindex -if. Trivial
example:
find . -name indexable.txt -print | recollindex -if
recollindex -i will not descend into subdirectories specified as
parameters, but just add them as index entries. It is up to the external
file selection method to build the complete file list.
2.7.2. Using cron to automate indexing
The most common way to set up indexing is to have a cron task execute it
every night. For example the following crontab entry would do it every day
at 3:30AM (supposing recollindex is in your PATH):
30 3 * * * recollindex > /some/tmp/dir/recolltrace 2>&1
Or, using anacron:
1 15 su mylogin -c "recollindex recollindex > /tmp/rcltraceme 2>&1"
As of version 1.17 the Recoll GUI has dialogs to manage crontab entries
for recollindex. You can reach them from the Preferences -> Indexing
Schedule menu. They only work with the good old cron, and do not give
access to all features of cron scheduling.
The usual command to edit your crontab is crontab -e (which will usually
start the vi editor to edit the file). You may have more sophisticated
tools available on your system.
Please be aware that there may be differences between your usual
interactive command line environment and the one seen by crontab commands.
Especially the PATH variable may be of concern. Please check the crontab
manual pages about possible issues.
2.8. Real time indexing
Real time monitoring/indexing is performed by starting the recollindex -m
command. With this option, recollindex will detach from the terminal and
become a daemon, permanently monitoring file changes and updating the
index.
Under KDE, Gnome and some other desktop environments, the daemon can
automatically started when you log in, by creating a desktop file inside
the ~/.config/autostart directory. This can be done for you by the Recoll
GUI. Use the Preferences->Indexing Schedule menu.
With older X11 setups, starting the daemon is normally performed as part
of the user session script.
The rclmon.sh script can be used to easily start and stop the daemon. It
can be found in the examples directory (typically
/usr/local/[share/]recoll/examples).
For example, my out of fashion xdm-based session has a .xsession script
with the following lines at the end:
recollconf=$HOME/.recoll-home
recolldata=/usr/local/share/recoll
RECOLL_CONFDIR=$recollconf $recolldata/examples/rclmon.sh start
fvwm
The indexing daemon gets started, then the window manager, for which the
session waits.
By default the indexing daemon will monitor the state of the X11 session,
and exit when it finishes, it is not necessary to kill it explicitly. (The
X11 server monitoring can be disabled with option -x to recollindex).
If you use the daemon completely out of an X11 session, you need to add
option -x to disable X11 session monitoring (else the daemon will not
start).
By default, the messages from the indexing daemon will be setn to the same
file as those from the interactive commands (logfilename). You may want to
change this by setting the daemlogfilename and daemloglevel configuration
parameters. Also the log file will only be truncated when the daemon
starts. If the daemon runs permanently, the log file may grow quite big,
depending on the log level.
When building Recoll, the real time indexing support can be customised
during package configuration with the --with[out]-fam or
--with[out]-inotify options. The default is currently to include inotify
monitoring on systems that support it, and, as of Recoll 1.17, gamin
support on FreeBSD.
While it is convenient that data is indexed in real time, repeated
indexing can generate a significant load on the system when files such as
email folders change. Also, monitoring large file trees by itself
significantly taxes system resources. You probably do not want to enable
it if your system is short on resources. Periodic indexing is adequate in
most cases.
Increasing resources for inotify
On Linux systems, monitoring a big tree may need increasing the resources
available to inotify, which are normally defined in /etc/sysctl.conf.
### inotify
#
# cat /proc/sys/fs/inotify/max_queued_events - 16384
# cat /proc/sys/fs/inotify/max_user_instances - 128
# cat /proc/sys/fs/inotify/max_user_watches - 16384
#
# -- Change to:
#
fs.inotify.max_queued_events=32768
fs.notify.max_user_instances=256
fs.inotify.max_user_watches=32768
Especially, you will need to trim your tree or adjust the max_user_watches
value if indexing exits with a message about errno ENOSPC (28) from
inotify_add_watch.
2.8.1. Slowing down the reindexing rate for fast changing files
When using the real time monitor, it may happen that some files need to be
indexed, but change so often that they impose an excessive load for the
system.
Recoll provides a configuration option to specify the minimum time before
which a file, specified by a wildcard pattern, cannot be reindexed. See
the mondelaypatterns parameter in the configuration section.
Chapter 3. Searching
3.1. Searching with the Qt graphical user interface
The recoll program provides the main user interface for searching. It is
based on the Qt library.
recoll has two search modes:
o Simple search (the default, on the main screen) has a single entry
field where you can enter multiple words.
o Advanced search (a panel accessed through the Tools menu or the
toolbox bar icon) has multiple entry fields, which you may use to
build a logical condition, with additional filtering on file type,
location in the file system, modification date, and size.
In most cases, you can enter the terms as you think them, even if they
contain embedded punctuation or other non-textual characters. For example,
Recoll can handle things like email addresses, or arbitrary cut and paste
from another text window, punctuation and all.
The main case where you should enter text differently from how it is
printed is for east-asian languages (Chinese, Japanese, Korean). Words
composed of single or multiple characters should be entered separated by
white space in this case (they would typically be printed without white
space).
Some searches can be quite complex, and you may want to re-use them later,
perhaps with some tweaking. Recoll versions 1.21 and later can save and
restore searches, using XML files. See Saving and restoring queries.
3.1.1. Simple search
1. Start the recoll program.
2. Possibly choose a search mode: Any term, All terms, File name or Query
language.
3. Enter search term(s) in the text field at the top of the window.