-
Notifications
You must be signed in to change notification settings - Fork 2
/
INSTALL
1348 lines (967 loc) · 56.9 KB
/
INSTALL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
More documentation can be found in the doc/ directory or at http://www.recoll.org
Link: home: Recoll user manual
Link: up: Recoll user manual
Link: prev: 4.3. API
Link: next: 5.2. Supporting packages
Chapter 5. Installation and configuration
Prev Next
----------------------------------------------------------------------
Chapter 5. Installation and configuration
5.1. Installing a binary copy
Recoll binary copies are always distributed as regular packages for your
system. They can be obtained either through the system's normal software
distribution framework (e.g. Debian/Ubuntu apt, FreeBSD ports, etc.), or
from some type of "backports" repository providing versions newer than the
standard ones, or found on the Recoll WEB site in some cases.
There used to exist another form of binary install, as pre-compiled source
trees, but these are just less convenient than the packages and don't
exist any more.
The package management tools will usually automatically deal with hard
dependencies for packages obtained from a proper package repository. You
will have to deal with them by hand for downloaded packages (for example,
when dpkg complains about missing dependencies).
In all cases, you will have to check or install supporting applications
for the file types that you want to index beyond those that are natively
processed by Recoll (text, HTML, email files, and a few others).
You should also maybe have a look at the configuration section (but this
may not be necessary for a quick test with default parameters). Most
parameters can be more conveniently set from the GUI interface.
----------------------------------------------------------------------
Prev Next
4.3. API Home 5.2. Supporting packages
Link: home: Recoll user manual
Link: up: Chapter 5. Installation and configuration
Link: prev: Chapter 5. Installation and configuration
Link: next: 5.3. Building from source
5.2. Supporting packages
Prev Chapter 5. Installation and configuration Next
----------------------------------------------------------------------
5.2. Supporting packages
Recoll uses external applications to index some file types. You need to
install them for the file types that you wish to have indexed (these are
run-time optional dependencies. None is needed for building or running
Recoll except for indexing their specific file type).
After an indexing pass, the commands that were found missing can be
displayed from the recoll File menu. The list is stored in the missing
text file inside the configuration directory.
A list of common file types which need external commands follows. Many of
the handlers need the iconv command, which is not always listed as a
dependency.
Please note that, due to the relatively dynamic nature of this
information, the most up to date version is now kept on
http://www.recoll.org/features.html along with links to the home pages or
best source/patches pages, and misc tips. The list below is not updated
often and may be quite stale.
For many Linux distributions, most of the commands listed can be installed
from the package repositories. However, the packages are sometimes
outdated, or not the best version for Recoll, so you should take a look at
http://www.recoll.org/features.html if a file type is important to you.
As of Recoll release 1.14, a number of XML-based formats that were handled
by ad hoc handler code now use the xsltproc command, which usually comes
with libxslt. These are: abiword, fb2 (ebooks), kword, openoffice, svg.
Now for the list:
o Openoffice files need unzip and xsltproc.
o PDF files need pdftotext which is part of Poppler (usually comes with
the poppler-utils package). Avoid the original one from Xpdf.
o Postscript files need pstotext. The original version has an issue with
shell character in file names, which is corrected in recent packages.
See http://www.recoll.org/features.html for more detail.
o MS Word needs antiword. It is also useful to have wvWare installed as
it may be be used as a fallback for some files which antiword does not
handle.
o MS Excel and PowerPoint are processed by internal Python handlers.
o MS Open XML (docx) needs xsltproc.
o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on
Ubuntu) package.
o RTF files need unrtf, which, in its older versions, has much trouble
with non-western character sets. Many Linux distributions carry
outdated unrtf versions. Check http://www.recoll.org/features.html for
details.
o TeX files need untex or detex. Check
http://www.recoll.org/features.html for sources if it's not packaged
for your distribution.
o dvi files need dvips.
o djvu files need djvutxt and djvused from the DjVuLibre package.
o Audio files: Recoll releases 1.14 and later use a single Python
handler based on mutagen for all audio file types.
o Pictures: Recoll uses the Exiftool Perl package to extract tag
information. Most image file formats are supported. Note that there
may not be much interest in indexing the technical tags (image size,
aperture, etc.). This is only of interest if you store personal tags
or textual descriptions inside the image files.
o chm: files in Microsoft help format need Python and the pychm module
(which needs chmlib).
o ICS: up to Recoll 1.13, iCalendar files need Python and the icalendar
module. icalendar is not needed for newer versions, which use internal
code.
o Zip archives need Python (and the standard zipfile module).
o Rar archives need Python, the rarfile Python module and the unrar
utility.
o Midi karaoke files need Python and the Midi module
o Konqueror webarchive format with Python (uses the Tarfile module).
o Mimehtml web archive format (support based on the email handler, which
introduces some mild weirdness, but still usable).
Text, HTML, email folders, and Scribus files are processed internally. Lyx
is used to index Lyx files. Many handlers need iconv and the standard sed
and awk.
----------------------------------------------------------------------
Prev Up Next
Chapter 5. Installation and configuration Home 5.3. Building from source
Link: home: Recoll user manual
Link: up: Chapter 5. Installation and configuration
Link: prev: 5.2. Supporting packages
Link: next: 5.4. Configuration overview
5.3. Building from source
Prev Chapter 5. Installation and configuration Next
----------------------------------------------------------------------
5.3. Building from source
5.3.1. Prerequisites
If you can install any or all of the following through the package manager
for your system, all the better. Especially Qt is a very big piece of
software, but you will most probably be able to find a binary package.
You may have to compile Xapian but this is easy.
The shopping list:
o C++ compiler. Up to Recoll version 1.13.04, its absence can manifest
itself by strange messages about a missing iconv_open.
o Development files for Xapian core.
Important
If you are building Xapian for an older CPU (before Pentium 4 or
Athlon 64), you need to add the --disable-sse flag to the configure
command. Else all Xapian application will crash with an illegal
instruction error.
o Development files for Qt 4 . Recoll has not been tested with Qt 5 yet.
Recoll 1.15.9 was the last version to support Qt 3. If you do not want
to install or build the Qt Webkit module, Recoll has a configuration
option to disable its use (see further).
o Development files for X11 and zlib.
o You may also need libiconv. On Linux systems, the iconv interface is
part of libc and you should not need to do anything special.
Check the Recoll download page for up to date version information.
5.3.2. Building
Recoll has been built on Linux, FreeBSD, Mac OS X, and Solaris, most
versions after 2005 should be ok, maybe some older ones too (Solaris 8 is
ok). If you build on another system, and need to modify things, I would
very much welcome patches.
Configure options:
o --without-aspell will disable the code for phonetic matching of search
terms.
o --with-fam or --with-inotify will enable the code for real time
indexing. Inotify support is enabled by default on recent Linux
systems.
o --with-qzeitgeist will enable sending Zeitgeist events about the
visited search results, and needs the qzeitgeist package.
o --disable-webkit is available from version 1.17 to implement the
result list with a Qt QTextBrowser instead of a WebKit widget if you
do not or can't depend on the latter.
o --disable-idxthreads is available from version 1.19 to suppress
multithreading inside the indexing process. You can also use the
run-time configuration to restrict recollindex to using a single
thread, but the compile-time option may disable a few more unused
locks. This only applies to the use of multithreading for the core
index processing (data input). The Recoll monitor mode always uses at
least two threads of execution.
o --disable-python-module will avoid building the Python module.
o --disable-xattr will prevent fetching data from file extended
attributes. Beyond a few standard attributes, fetching extended
attributes data can only be useful is some application stores data in
there, and also needs some simple configuration (see comments in the
fields configuration file).
o --enable-camelcase will enable splitting camelCase words. This is not
enabled by default as it has the unfortunate side-effect of making
some phrase searches quite confusing: ie, "MySQL manual" would be
matched by "MySQL manual" and "my sql manual" but not "mysql manual"
(only inside phrase searches).
o --with-file-command Specify the version of the 'file' command to use
(ie: --with-file-command=/usr/local/bin/file). Can be useful to enable
the gnu version on systems where the native one is bad.
o --disable-qtgui Disable the Qt interface. Will allow building the
indexer and the command line search program in absence of a Qt
environment.
o --disable-x11mon Disable X11 connection monitoring inside recollindex.
Together with --disable-qtgui, this allows building recoll without Qt
and X11.
o --disable-pic will compile Recoll with position-dependant code. This
is incompatible with building the KIO or the Python or PHP extensions,
but might yield very marginally faster code.
o Of course the usual autoconf configure options, like --prefix apply.
Normal procedure:
cd recoll-xxx
./configure
make
(practices usual hardship-repelling invocations)
There is little auto-configuration. The configure script will mainly link
one of the system-specific files in the mk directory to mk/sysconf. If
your system is not known yet, it will tell you as much, and you may want
to manually copy and modify one of the existing files (the new file name
should be the output of uname -s).
5.3.2.1. Building on Solaris
We did not test building the GUI on Solaris for recent versions. You will
need at least Qt 4.4. There are some hints on an old web site page, they
may still be valid.
Someone did test the 1.19 indexer and Python module build, they do work,
with a few minor glitches. Be sure to use GNU make and install.
5.3.3. Installation
Either type make install or execute recollinstall prefix, in the root of
the source tree. This will copy the commands to prefix/bin and the sample
configuration files, scripts and other shared data to prefix/share/recoll.
If the installation prefix given to recollinstall is different from either
the system default or the value which was specified when executing
configure (as in configure --prefix /some/path), you will have to set the
RECOLL_DATADIR environment variable to indicate where the shared data is
to be found (ie for (ba)sh: export
RECOLL_DATADIR=/some/path/share/recoll).
You can then proceed to configuration.
----------------------------------------------------------------------
Prev Up Next
5.2. Supporting packages Home 5.4. Configuration overview
Link: home: Recoll user manual
Link: up: Chapter 5. Installation and configuration
Link: prev: 5.3. Building from source
5.4. Configuration overview
Prev Chapter 5. Installation and configuration
----------------------------------------------------------------------
5.4. Configuration overview
Most of the parameters specific to the recoll GUI are set through the
Preferences menu and stored in the standard Qt place
($HOME/.config/Recoll.org/recoll.conf). You probably do not want to edit
this by hand.
Recoll indexing options are set inside text configuration files located in
a configuration directory. There can be several such directories, each of
which defines the parameters for one index.
The configuration files can be edited by hand or through the Index
configuration dialog (Preferences menu). The GUI tool will try to respect
your formatting and comments as much as possible, so it is quite possible
to use both ways.
The most accurate documentation for the configuration parameters is given
by comments inside the default files, and we will just give a general
overview here.
By default, for each index, there are two sets of configuration files.
System-wide configuration files are kept in a directory named like
/usr/[local/]share/recoll/examples, and define default values, shared by
all indexes. For each index, a parallel set of files defines the
customized parameters.
In addition (as of Recoll version 1.19.7), it is possible to specify two
additional configuration directories which will be stacked before and
after the user configuration directory. These are defined by the
RECOLL_CONFTOP and RECOLL_CONFMID environment variables. Values from
configuration files inside the top directory will override user ones,
values from configuration files inside the middle directory will override
system ones and be overridden by user ones. These two variables may be of
use to applications which augment Recoll functionality, and need to add
configuration data without disturbing the user's files. Please note that
the two, currently single, values will probably be interpreted as
colon-separated lists in the future: do not use colon characters inside
the directory paths.
The default location of the configuration is the .recoll directory in your
home. Most people will only use this directory.
This location can be changed, or others can be added with the
RECOLL_CONFDIR environment variable or the -c option parameter to recoll
and recollindex.
If the .recoll directory does not exist when recoll or recollindex are
started, it will be created with a set of empty configuration files.
recoll will give you a chance to edit the configuration file before
starting indexing. recollindex will proceed immediately. To avoid
mistakes, the automatic directory creation will only occur for the default
location, not if -c or RECOLL_CONFDIR were used (in the latter cases, you
will have to create the directory).
All configuration files share the same format. For example, a short
extract of the main configuration file might look as follows:
# Space-separated list of directories to index.
topdirs = ~/docs /usr/share/doc
[~/somedirectory-with-utf8-txt-files]
defaultcharset = utf-8
There are three kinds of lines:
o Comment (starts with #) or empty.
o Parameter affectation (name = value).
o Section definition ([somedirname]).
Depending on the type of configuration file, section definitions either
separate groups of parameters or allow redefining some parameters for a
directory sub-tree. They stay in effect until another section definition,
or the end of file, is encountered. Some of the parameters used for
indexing are looked up hierarchically from the current directory location
upwards. Not all parameters can be meaningfully redefined, this is
specified for each in the next section.
When found at the beginning of a file path, the tilde character (~) is
expanded to the name of the user's home directory, as a shell would do.
White space is used for separation inside lists. List elements with
embedded spaces can be quoted using double-quotes.
Encoding issues. Most of the configuration parameters are plain ASCII. Two
particular sets of values may cause encoding issues:
o File path parameters may contain non-ascii characters and should use
the exact same byte values as found in the file system directory.
Usually, this means that the configuration file should use the system
default locale encoding.
o The unac_except_trans parameter should be encoded in UTF-8. If your
system locale is not UTF-8, and you need to also specify non-ascii
file paths, this poses a difficulty because common text editors cannot
handle multiple encodings in a single file. In this relatively
unlikely case, you can edit the configuration file as two separate
text files with appropriate encodings, and concatenate them to create
the complete configuration.
5.4.1. Environment variables
RECOLL_CONFDIR
Defines the main configuration directory.
RECOLL_TMPDIR, TMPDIR
Locations for temporary files, in this order of priority. The
default if none of these is set is to use /tmp. Big temporary
files may be created during indexing, mostly for decompressing,
and also for processing, e.g. email attachments.
RECOLL_CONFTOP, RECOLL_CONFMID
Allow adding configuration directories with priorities below and
above the user directory (see above the Configuration overview
section for details).
RECOLL_EXTRA_DBS, RECOLL_ACTIVE_EXTRA_DBS
Help for setting up external indexes. See this paragraph for
explanations.
RECOLL_DATADIR
Defines replacement for the default location of Recoll data files,
normally found in, e.g., /usr/share/recoll).
RECOLL_FILTERSDIR
Defines replacement for the default location of Recoll filters,
normally found in, e.g., /usr/share/recoll/filters).
ASPELL_PROG
aspell program to use for creating the spelling dictionary. The
result has to be compatible with the libaspell which Recoll is
using.
VARNAME
Blabla
5.4.2. The main configuration file, recoll.conf
recoll.conf is the main configuration file. It defines things like what to
index (top directories and things to ignore), and the default character
set to use for document types which do not specify it internally.
The default configuration will index your home directory. If this is not
appropriate, start recoll to create a blank configuration, click Cancel,
and edit the configuration file before restarting the command. This will
start the initial indexing, which may take some time.
Most of the following parameters can be changed from the Index
Configuration menu in the recoll interface. Some can only be set by
editing the configuration file.
5.4.2.1. Parameters affecting what documents we index:
topdirs
Specifies the list of directories or files to index (recursively
for directories). You can use symbolic links as elements of this
list. See the followLinks option about following symbolic links
found under the top elements (not followed by default).
skippedNames
A space-separated list of wildcard patterns for names of files or
directories that should be completely ignored. The list defined in
the default file is:
skippedNames = #* bin CVS Cache cache* caughtspam tmp .thumbnails .svn \
*~ .beagle .git .hg .bzr loop.ps .xsession-errors \
.recoll* xapiandb recollrc recoll.conf
The list can be redefined at any sub-directory in the indexed
area.
The top-level directories are not affected by this list (that is,
a directory in topdirs might match and would still be indexed).
The list in the default configuration does not exclude hidden
directories (names beginning with a dot), which means that it may
index quite a few things that you do not want. On the other hand,
email user agents like thunderbird usually store messages in
hidden directories, and you probably want this indexed. One
possible solution is to have .* in skippedNames, and add things
like ~/.thunderbird or ~/.evolution in topdirs.
Not even the file names are indexed for patterns in this list. See
the noContentSuffixes variable for an alternative approach which
indexes the file names.
noContentSuffixes
This is a list of file name endings (not wildcard expressions, nor
dot-delimited suffixes). Only the names of matching files will be
indexed (no attempt at MIME type identification, no decompression,
no content indexing). This can be redefined for subdirectories,
and edited from the GUI. The default value is:
noContentSuffixes = .md5 .map \
.o .lib .dll .a .sys .exe .com \
.mpp .mpt .vsd \
.img .img.gz .img.bz2 .img.xz .image .image.gz .image.bz2 .image.xz \
.dat .bak .rdf .log.gz .log .db .msf .pid \
,v ~ #
skippedPaths and daemSkippedPaths
A space-separated list of patterns for paths of files or
directories that should be skipped. There is no default in the
sample configuration file, but the code always adds the
configuration and database directories in there.
skippedPaths is used both by batch and real time indexing.
daemSkippedPaths can be used to specify things that should be
indexed at startup, but not monitored.
Example of use for skipping text files only in a specific
directory:
skippedPaths = ~/somedir/*.txt
skippedPathsFnmPathname
The values in the *skippedPaths variables are matched by default
with fnmatch(3), with the FNM_PATHNAME flag. This means that '/'
characters must be matched explicitly. You can set
skippedPathsFnmPathname to 0 to disable the use of FNM_PATHNAME
(meaning that /*/dir3 will match /dir1/dir2/dir3).
zipSkippedNames
A space-separated list of patterns for names of files or
directories that should be ignored inside zip archives. This is
used directly by the zip handler, and has a function similar to
skippedNames, but works independently. Can be redefined for
filesystem subdirectories. For versions up to 1.19, you will need
to update the Zip handler and install a supplementary Python
module. The details are described on the Recoll wiki.
followLinks
Specifies if the indexer should follow symbolic links while
walking the file tree. The default is to ignore symbolic links to
avoid multiple indexing of linked files. No effort is made to
avoid duplication when this option is set to true. This option can
be set individually for each of the topdirs members by using
sections. It can not be changed below the topdirs level.
indexedmimetypes
Recoll normally indexes any file which it knows how to read. This
list lets you restrict the indexed MIME types to what you specify.
If the variable is unspecified or the list empty (the default),
all supported types are processed. Can be redefined for
subdirectories.
excludedmimetypes
This list lets you exclude some MIME types from indexing. Can be
redefined for subdirectories.
compressedfilemaxkbs
Size limit for compressed (.gz or .bz2) files. These need to be
decompressed in a temporary directory for identification, which
can be very wasteful if 'uninteresting' big compressed files are
present. Negative means no limit, 0 means no processing of any
compressed file. Defaults to -1.
textfilemaxmbs
Maximum size for text files. Very big text files are often
uninteresting logs. Set to -1 to disable (default 20MB).
textfilepagekbs
If set to other than -1, text files will be indexed as multiple
documents of the given page size. This may be useful if you do
want to index very big text files as it will both reduce memory
usage at index time and help with loading data to the preview
window. A size of a few megabytes would seem reasonable (default:
1MB).
membermaxkbs
This defines the maximum size in kilobytes for an archive member
(zip, tar or rar at the moment). Bigger entries will be skipped.
indexallfilenames
Recoll indexes file names in a special section of the database to
allow specific file names searches using wild cards. This
parameter decides if file name indexing is performed only for
files with MIME types that would qualify them for full text
indexing, or for all files inside the selected subtrees,
independently of MIME type.
usesystemfilecommand
Decide if we execute a system command (file -i by default) as a
final step for determining the MIME type for a file (the main
procedure uses suffix associations as defined in the mimemap
file). This can be useful for files with suffix-less names, but it
will also cause the indexing of many bogus "text" files.
systemfilecommand
Command to use for mime for mime type determination if
usesystefilecommand is set. Recent versions of xdg-mime sometimes
work better than file.
processwebqueue
If this is set, process the directory where Web browser plugins
copy visited pages for indexing.
webqueuedir
The path to the web indexing queue. This is hard-coded in the
Firefox plugin as ~/.recollweb/ToIndex so there should be no need
to change it.
5.4.2.2. Parameters affecting how we generate terms:
Changing some of these parameters will imply a full reindex. Also, when
using multiple indexes, it may not make sense to search indexes that don't
share the values for these parameters, because they usually affect both
search and index operations.
indexStripChars
Decide if we strip characters of diacritics and convert them to
lower-case before terms are indexed. If we don't, searches
sensitive to case and diacritics can be performed, but the index
will be bigger, and some marginal weirdness may sometimes occur.
The default is a stripped index (indexStripChars = 1) for now.
When using multiple indexes for a search, this parameter must be
defined identically for all. Changing the value implies an index
reset.
maxTermExpand
Maximum expansion count for a single term (e.g.: when using
wildcards). The default of 10000 is reasonable and will avoid
queries that appear frozen while the engine is walking the term
list.
maxXapianClauses
Maximum number of elementary clauses we can add to a single Xapian
query. In some cases, the result of term expansion can be
multiplicative, and we want to avoid using excessive memory. The
default of 100 000 should be both high enough in most cases and
compatible with current typical hardware configurations.
nonumbers
If this set to true, no terms will be generated for numbers. For
example "123", "1.5e6", 192.168.1.4, would not be indexed
("value123" would still be). Numbers are often quite interesting
to search for, and this should probably not be set except for
special situations, ie, scientific documents with huge amounts of
numbers in them. This can only be set for a whole index, not for a
subtree.
nocjk
If this set to true, specific east asian (Chinese Korean Japanese)
characters/word splitting is turned off. This will save a small
amount of cpu if you have no CJK documents. If your document base
does include such text but you are not interested in searching it,
setting nocjk may be a significant time and space saver.
cjkngramlen
This lets you adjust the size of n-grams used for indexing CJK
text. The default value of 2 is probably appropriate in most
cases. A value of 3 would allow more precision and efficiency on
longer words, but the index will be approximately twice as large.
indexstemminglanguages
A list of languages for which the stem expansion databases will be
built. See recollindex(1) or use the recollindex -l command for
possible values. You can add a stem expansion database for a
different language by using recollindex -s, but it will be deleted
during the next indexing. Only languages listed in the
configuration file are permanent.
defaultcharset
The name of the character set used for files that do not contain a
character set definition (ie: plain text files). This can be
redefined for any sub-directory. If it is not set at all, the
character set used is the one defined by the nls environment (
LC_ALL, LC_CTYPE, LANG), or iso8859-1 if nothing is set.
unac_except_trans
This is a list of characters, encoded in UTF-8, which should be
handled specially when converting text to unaccented lowercase.
For example, in Swedish, the letter a with diaeresis has full
alphabet citizenship and should not be turned into an a. Each
element in the space-separated list has the special character as
first element and the translation following. The handling of both
the lowercase and upper-case versions of a character should be
specified, as appartenance to the list will turn-off both standard
accent and case processing. Example for Swedish:
unac_except_trans = aaaa AAaa a:a: A:a: o:o: O:o:
Note that the translation is not limited to a single character,
you could very well have something like u:ue in the list.
The default value set for unac_except_trans can't be listed here
because I have trouble with SGML and UTF-8, but it only contains
ligature decompositions: german ss, oe, ae, fi, fl.
This parameter can't be defined for subdirectories, it is global,
because there is no way to do otherwise when querying. If you have
document sets which would need different values, you will have to
index and query them separately.
maildefcharset
This can be used to define the default character set specifically
for email messages which don't specify it. This is mainly useful
for readpst (libpst) dumps, which are utf-8 but do not say so.
localfields
This allows setting fields for all documents under a given
directory. Typical usage would be to set an "rclaptg" field, to be
used in mimeview to select a specific viewer. If several fields
are to be set, they should be separated with a semi-colon (';')
character, which there is currently no way to escape. Also note
the initial semi-colon. Example: localfields= ;rclaptg=gnus;other
= val, then select specifier viewer with mimetype|tag=... in
mimeview.
testmodifusemtime
If true, use mtime instead of default ctime to determine if a file
has been modified (in addition to size, which is always used).
Setting this can reduce re-indexing on systems where extended
attributes are modified (by some other application), but not
indexed (changing extended attributes only affects ctime). Notes:
o This may prevent detection of change in some marginal file
rename cases (the target would need to have the same size and
mtime).
o You should probably also set noxattrfields to 1 in this case,
except if you still prefer to perform xattr indexing, for
example if the local file update pattern makes it of value
(as in general, there is a risk for pure extended attributes
updates without file modification to go undetected).
Perform a full index reset after changing the value of this
parameter.
noxattrfields
Recoll versions 1.19 and later automatically translate file
extended attributes into document fields (to be processed
according to the parameters from the fields file). Setting this
variable to 1 will disable the behaviour.
metadatacmds
This allows executing external commands for each file and storing
the output in Recoll document fields. This could be used for
example to index external tag data. The value is a list of field
names and commands, don't forget an initial semi-colon. Example:
[/some/area/of/the/fs]
metadatacmds = ; tags = tmsu tags %f; otherfield = somecmd -xx %f
As a specially disgusting hack brought by Recoll 1.19.7, if a
"field name" begins with rclmulti, the data returned by the
command is expected to contain multiple field values, in
configuration file format. This allows setting several fields by
executing a single command. Example:
metadatacmds = ; rclmulti1 = somecmd %f
If somecmd returns data in the form of:
field1 = value1
field2 = value for field2
field1 and field2 will be set inside the document metadata.
5.4.2.3. Parameters affecting where and how we store things:
dbdir
The name of the Xapian data directory. It will be created if
needed when the index is initialized. If this is not an absolute
path, it will be interpreted relative to the configuration
directory. The value can have embedded spaces but starting or
trailing spaces will be trimmed. You cannot use quotes here.
idxstatusfile
The name of the scratch file where the indexer process updates its
status. Default: idxstatus.txt inside the configuration directory.
maxfsoccuppc
Maximum file system occupation before we stop indexing. The value
is a percentage, corresponding to what the "Capacity" df output
column shows. The default value is 0, meaning no checking.
mboxcachedir
The directory where mbox message offsets cache files are held.
This is normally $RECOLL_CONFDIR/mboxcache, but it may be useful
to share a directory between different configurations.
mboxcacheminmbs
The minimum mbox file size over which we cache the offsets. There
is really no sense in caching offsets for small files. The default
is 5 MB.
webcachedir
This is only used by the web browser plugin indexing code, and
defines where the cache for visited pages will live. Default:
$RECOLL_CONFDIR/webcache
webcachemaxmbs
This is only used by the web browser plugin indexing code, and
defines the maximum size for the web page cache. Default: 40 MB.
Quite unfortunately, this is only taken into account when creating
the cache file. You need to delete the file for a change to be
taken into account.
idxflushmb
Threshold (megabytes of new text data) where we flush from memory
to disk index. Setting this can help control memory usage. A value
of 0 means no explicit flushing, letting Xapian use its own
default, which is flushing every 10000 (or XAPIAN_FLUSH_THRESHOLD)
documents, which gives little memory usage control, as memory
usage also depends on average document size. The default value is
10, and it is probably a bit low. If your system usually has free
memory, you can try higher values between 20 and 80. In my
experience, values beyond 100 are always counterproductive.
5.4.2.4. Parameters affecting multithread processing
The Recoll indexing process recollindex can use multiple threads to speed
up indexing on multiprocessor systems. The work done to index files is
divided in several stages and some of the stages can be executed by
multiple threads. The stages are:
1. File system walking: this is always performed by the main thread.
2. File conversion and data extraction.
3. Text processing (splitting, stemming, etc.)
4. Xapian index update.
You can also read a longer document about the transformation of Recoll
indexing to multithreading.
The threads configuration is controlled by two configuration file
parameters.
thrQSizes
This variable defines the job input queues configuration. There
are three possible queues for stages 2, 3 and 4, and this
parameter should give the queue depth for each stage (three
integer values). If a value of -1 is used for a given stage, no
queue is used, and the thread will go on performing the next
stage. In practise, deep queues have not been shown to increase
performance. A value of 0 for the first queue tells Recoll to
perform autoconfiguration (no need for the two other values in
this case) - this is the default configuration.
thrTCounts
This defines the number of threads used for each stage. If a value
of -1 is used for one of the queue depths, the corresponding
thread count is ignored. It makes no sense to use a value other
than 1 for the last stage because updating the Xapian index is
necessarily single-threaded (and protected by a mutex).
The following example would use three queues (of depth 2), and 4 threads
for converting source documents, 2 for processing their text, and one to
update the index. This was tested to be the best configuration on the test
system (quadri-processor with multiple disks).
thrQSizes = 2 2 2
thrTCounts = 4 2 1
The following example would use a single queue, and the complete
processing for each document would be performed by a single thread
(several documents will still be processed in parallel in most cases). The
threads will use mutual exclusion when entering the index update stage. In
practise the performance would be close to the precedent case in general,
but worse in certain cases (e.g. a Zip archive would be performed purely
sequentially), so the previous approach is preferred. YMMV... The 2 last
values for thrTCounts are ignored.
thrQSizes = 2 -1 -1
thrTCounts = 6 1 1
The following example would disable multithreading. Indexing will be
performed by a single thread.
thrQSizes = -1 -1 -1
5.4.2.5. Miscellaneous parameters:
autodiacsens
IF the index is not stripped, decide if we automatically trigger
diacritics sensitivity if the search term has accented characters
(not in unac_except_trans). Else you need to use the query
language and the D modifier to specify diacritics sensitivity.
Default is no.
autocasesens
IF the index is not stripped, decide if we automatically trigger
character case sensitivity if the search term has upper-case
characters in any but the first position. Else you need to use the
query language and the C modifier to specify character-case
sensitivity. Default is yes.
loglevel,daemloglevel
Verbosity level for recoll and recollindex. A value of 4 lists
quite a lot of debug/information messages. 2 only lists errors.
The daemversion is specific to the indexing monitor daemon.
logfilename, daemlogfilename
Where the messages should go. 'stderr' can be used as a special
value, and is the default. The daemversion is specific to the
indexing monitor daemon.
checkneedretryindexscript
This defines the name for a command executed by recollindex when
starting indexing. If the exit status of the command is 0,
recollindex retries to index all files which previously could not
be indexed because of data extraction errors. The default value is
a script which checks if any of the common bin directories have
changed (indicating that a helper program may have been
installed).
mondelaypatterns
This allows specify wildcard path patterns (processed with
fnmatch(3) with 0 flag), to match files which change too often and
for which a delay should be observed before re-indexing. This is a
space-separated list, each entry being a pattern and a time in
seconds, separated by a colon. You can use double quotes if a path
entry contains white space. Example:
mondelaypatterns = *.log:20 "this one has spaces*:10"
monixinterval
Minimum interval (seconds) for processing the indexing queue. The
real time monitor does not process each event when it comes in,