forked from LibraryOfCongress/bagit-spec
-
Notifications
You must be signed in to change notification settings - Fork 0
/
bagit.xml
1245 lines (1195 loc) · 52.9 KB
/
bagit.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!ENTITY RFC1321 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.1321.xml">
<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC5234 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5234.xml">
<!ENTITY RFC2629 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2629.xml">
<!ENTITY RFC3174 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3174.xml">
<!ENTITY RFC3552 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3552.xml">
<!ENTITY RFC3629 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3629.xml">
<!ENTITY RFC3986 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3986.xml">
<!ENTITY RFC5226 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5226.xml">
<!ENTITY RFC6234 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.6234.xml">
<!ENTITY RFC6920 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.6920.xml">
<!ENTITY RFC8174 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8174.xml">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<?rfc strict="yes" ?>
<?rfc inline="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc toc="yes"?>
<rfc number="8493"
category="info"
submissionType="independent"
consensus="yes"
ipr="trust200902">
<front>
<title abbrev="BagIt">
The BagIt File Packaging Format (V1.0)
</title>
<author initials="J." surname="Kunze" fullname="John A. Kunze">
<organization>
California Digital Library
</organization>
<address>
<postal>
<street>415 20th St, 4th Floor</street>
<city>Oakland</city>
<region>CA</region>
<code>94612</code>
<country>United States of America</country>
</postal>
<email>[email protected]</email>
</address>
</author>
<author initials="J." surname="Littman" fullname="Justin Littman">
<organization>
Stanford Libraries
</organization>
<address>
<postal>
<street>518 Memorial Way</street>
<city>Stanford</city>
<region>CA</region>
<code>94305</code>
<country>United States of America</country>
</postal>
<email>[email protected]</email>
</address>
</author>
<author initials="E." surname="Madden" fullname="Liz Madden">
<organization>
Library of Congress
</organization>
<address>
<postal>
<street>101 Independence Avenue SE</street>
<city>Washington</city>
<region>DC</region>
<code>20540</code>
<country>United States of America</country>
</postal>
<email>[email protected]</email>
</address>
</author>
<author initials="J." surname="Scancella" fullname="John Scancella">
<address>
<email>[email protected]</email>
</address>
</author>
<author initials="C." surname="Adams" fullname="Chris Adams">
<organization>
Library of Congress
</organization>
<address>
<postal>
<street>101 Independence Avenue SE</street>
<city>Washington</city>
<region>DC</region>
<code>20540</code>
<country>United States of America</country>
</postal>
<email>[email protected]</email>
</address>
</author>
<date month="October" year="2018"/>
<abstract>
<t>
This document describes BagIt, a set of hierarchical file layout conventions for
storage and transfer of arbitrary digital content. A "bag" has just enough
structure to enclose descriptive metadata "tags" and a file "payload" but
does not require knowledge of the payload's internal semantics. This
BagIt format is suitable for reliable storage and transfer.
</t>
</abstract>
</front>
<middle>
<section title="Introduction">
<section title="Purpose">
<t>
BagIt is a set of hierarchical file layout conventions designed to support
storage and transfer of arbitrary digital content.
A "bag" consists of a directory containing the payload files and other accompanying
metadata files known as "tag" files. The "tags" are metadata files intended to
facilitate and document the storage and transfer of the bag. Processing a bag
does not require any understanding of the payload file contents, and the payload
files can be accessed without processing the BagIt metadata.
</t>
<t>
The name, BagIt, is inspired by the "enclose and deposit" method
<xref target="ENCDEP"/>, sometimes referred to as "bag it and tag it".
BagIt differs from serialized archival formats such as MIME, TAR, or ZIP
in two general areas:
<list style="numbers"><t>
Strong integrity assurances. The format supports cryptographic-quality
hash algorithms (see <xref target="bag-checksum-algorithms"/>) and allows
for in-place upgrades to add additional manifests using stronger algorithms
without breaking backwards compatibility. This provides high
levels of confidence against data corruption, but it is not designed
to be secure against active attacks.
</t><t>
Direct file access. Because BagIt specifies an actual filesystem hierarchy
rather than a serialized representation of one, files can be accessed
using standard operating system utilities, implementations do not need
to process a potentially large archival file to extract a subset of data,
and the format imposes no size limits for either individual files or a bag.
</t></list>
</t>
<t>
BagIt is widely used for preserving digital assets originating from different
domains. Organizations involved in digital preservation with BagIt include
the Library of Congress, Dryad Data Repository, NSF DataONE, and the
Rockefeller Archive Center. Software implementations are available for many
languages, including Python, Ruby, Java, Perl, and PHP. It is also used in
the libraries of many universities, such as Cornell, Purdue, Stanford,
Ghent University, New York University, and the University of California.
</t>
</section>
<!-- /Purpose -->
<section title="Requirements">
<t>
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED",
"MAY", and "OPTIONAL" in this document are to be interpreted as
described in BCP 14 <xref target="RFC2119"/> <xref target="RFC8174"/>
when, and only when, they appear in all capitals, as shown here.
</t>
<t>
Implementers are strongly encouraged to review the interoperability
considerations described in <xref target="sec-interoperability"/>.
</t>
</section>
<!-- /Requirements -->
<section title="Terminology">
<t>
The following terms have precise definitions as used in this document:
</t>
<t>
<list style="hanging">
<t hangText="bag:">
A set of opaque files contained within the structure
defined by this document.
</t>
<t hangText="bag declaration:">
The file required to be in all bags conforming to this document.
Contains values necessary to process the rest of a bag.
See <xref target="sec-bag-decl"/>.
</t>
<t hangText="bag checksum algorithm:">
The name of a cryptographic checksum algorithm that has been normalized
for use in a manifest or tag manifest file name (e.g., "sha512")
as described in <xref target="bag-checksum-algorithms"/>.
</t>
<t hangText="manifest:">
A tag file that maps filepaths to checksums. A manifest can be a payload
manifest (see <xref target="sec-payload-manifest"/>) or a
tag manifest (see <xref target="sec-tag-manifest"/>).
</t>
<t hangText="payload:">
The data encapsulated by the bag as a set of named files, which may be
organized in subdirectories. The contents of the payload files
are opaque to this document, and, with respect to BagIt processing,
are always considered as sequences of uninterpreted octets.
See <xref target="sec-payload-dir"/>.
</t>
<t hangText="tag directory:">
A directory that contains one or more tag files.
</t>
<t hangText="tag file:">
A file that contains metadata about the bag or its payload.
This document defines the standard
BagIt tag files:
the bag declaration in "bagit.txt" (see <xref target="sec-bag-decl"/>),
payload manifests (see <xref target="sec-payload-manifest"/>),
tag manifests (see <xref target="sec-tag-manifest"/>),
bag metadata in "bag-info.txt" (see <xref target="sec-bag-info"/>),
and remote payload in "fetch.txt" (see <xref target="sec-fetch-file"/>).
This document also allows other arbitrary tag files as described in
<xref target="sec-other-tag-files"/>.
</t>
<t hangText="complete:">
A bag that contains every element required by this document,
every payload file listed in a manifest, and any optional files that are
listed in a tag manifest. See <xref target="sec-complete-valid"/>.
</t>
<t hangText="valid:">
A complete bag where every checksum in every manifest has been
successfully verified against the corresponding file.
</t>
</list>
</t>
</section>
<!-- /Terminology -->
</section>
<!-- /Introduction -->
<section title="Structure">
<t>
A bag MUST consist of a base directory containing the following:
</t>
<t>
<list style="numbers">
<t>a set of required and optional tag files (see <xref target="sec-optional-elements"/>);</t>
<t>a subdirectory named "data", called the payload directory (see
<xref target="sec-payload-dir"/>); and</t>
<t>a set of optional tag directories.</t>
</list>
</t>
<t>
The tag files in the base directory consist of one or more files named
"manifest-<spanx style="emph">algorithm</spanx>.txt"
(see Sections <xref target="sec-payload-manifest" format="counter"/> and
<xref target="bag-checksum-algorithms" format="counter"/>),
a file named "bagit.txt" (see <xref target="sec-bag-decl"/>),
and zero or more additional tag files (see
<xref target="sec-optional-elements"/>). The tag files and directories are
in arbitrary file hierarchies and MAY have
any name that is not reserved for a file or directory in this document.
</t>
<t>
The base directory can have any name, as illustrated by the figure below.
</t>
<figure>
<artwork>
<base directory>/
|
+-- bagit.txt
|
+-- manifest-<algorithm>.txt
|
+-- [additional tag files]
|
+-- data/
| |
| +-- [payload files]
|
+-- [tag directories]/
|
+-- [tag files] </artwork>
</figure>
<section title="Required Elements" anchor="sec-required-elements">
<section title="Bag Declaration: bagit.txt" anchor="sec-bag-decl">
<t>
The "bagit.txt" tag file MUST consist of exactly two lines in this order:
</t>
<figure>
<artwork>
BagIt-Version: M.N
Tag-File-Character-Encoding: ENCODING </artwork>
<postamble>
<spanx style="emph">M.N</spanx> identifies the BagIt major (M) and minor (N) version numbers.
<spanx style="emph">ENCODING</spanx> identifies the character set encoding used by the remaining tag files.
<spanx style="emph">ENCODING</spanx> SHOULD
be <spanx style="verb">UTF-8</spanx>, but
for backwards compatibility it MAY be any
other encoding registered in <xref target="cs-registry"/>.
The bag declaration itself MUST be encoded in UTF-8 and MUST NOT contain a
Byte Order Mark (BOM) <xref target="RFC3629"/>.
</postamble>
</figure>
<t>
The number for this version of BagIt is "1.0".
</t>
</section>
<!-- /Bag Declaration -->
<section title="Payload Directory: data/" anchor="sec-payload-dir">
<t>
The base directory MUST contain a subdirectory named "data".
</t>
<t>
The payload directory contains the arbitrary digital content within the bag.
The files under the payload directory are called payload files, or the payload.
Each payload file is treated as an opaque octet stream when verifying file
correctness.
Payload files MAY be organized in arbitrary subdirectory structures
within the payload directory; however, for the purpose of this document,
such subdirectory structures and filenames have no given meaning.
</t>
</section>
<!-- /Payload Directory -->
<section title="Payload Manifest: manifest-algorithm.txt" anchor="sec-payload-manifest">
<t>
A payload manifest file provides a complete listing of each payload file name along
with a corresponding checksum to permit data integrity checking. A bag can have more
than one payload manifest, with each using a different checksum algorithm.
Manifest entries MUST satisfy the following constraints:
</t>
<t>
<list style="symbols">
<t>
Every bag MUST contain at least one payload manifest file and MAY contain
more than one.
</t>
<t>
Every payload manifest MUST list every payload file name exactly once.
</t>
<t>
A payload manifest file MUST have a name of the form
"manifest-<spanx style="emph">algorithm</spanx>.txt", where
<spanx style="emph">algorithm</spanx>
is a string specifying the checksum algorithm used by that
manifest as described in <xref
target="bag-checksum-algorithms"/>.
</t>
</list>
</t>
<t>Example payload manifest filenames:</t>
<figure>
<artwork>
manifest-sha256.txt
manifest-sha512.txt
</artwork>
</figure>
<t>
Each line of a payload manifest file MUST be of the form
</t>
<figure>
<artwork>checksum filepath</artwork>
</figure>
<t>where <spanx style="emph">filepath</spanx> is the pathname of a file
relative to the base directory, and <spanx style="emph">checksum</spanx> is a
hex-encoded checksum calculated by applying <spanx
style="emph">algorithm</spanx> over the file.
</t>
<t>
<list style="symbols">
<t>The hex-encoded checksum MAY use uppercase and/or lowercase letters.</t>
<t>The slash character ('/') MUST be used as a path separator
in <spanx style="emph">filepath</spanx>.</t>
<t>One or more linear whitespace characters (spaces or tabs)
MUST separate <spanx style="emph">checksum</spanx> from
<spanx style="emph">filepath</spanx>.</t>
<t>There is no limitation on the length of a pathname.</t>
<t>The payload manifest MUST NOT reference files outside the payload directory.</t>
<t>
If a <spanx style="emph">filepath</spanx> includes a Line Feed
(LF), a Carriage Return (CR),
a Carriage-Return Line Feed (CRLF), or a
percent sign (%), those characters (and only those) MUST be
percent-encoded following <xref target="RFC3986"/>.
</t>
</list>
</t>
<t>
A manifest MUST NOT reference directories. Bag creators who wish to create
an otherwise empty directory have typically done so by creating an empty
placeholder file with a name such as ".keep".
</t>
</section>
<!-- /Payload Manifest -->
</section>
<!-- /Required Elements -->
<section title="Optional Elements" anchor="sec-optional-elements">
<section anchor="sec-tag-manifest" title="Tag Manifest: tagmanifest-algorithm.txt">
<t>
A tag manifest is a tag file that lists other tag files and
checksums for those tag files generated using a particular bag
checksum algorithm.
</t>
<t>
A bag MAY contain one or more tag manifests, in which case each tag manifest SHOULD list the same set of tag files.
</t>
<t>
Each tag manifest MUST list every payload manifest.
Each tag manifest MUST NOT list any tag manifests
but SHOULD list the remaining tag files present in the bag.
</t>
<t>
A tag manifest file MUST have a name of the form
"tagmanifest-<spanx style="emph">algorithm</spanx>.txt",
where <spanx style="emph">algorithm</spanx> is a string following
the format described in <xref target="bag-checksum-algorithms"/>
that specifies the bag checksum algorithm used in that manifest.
</t>
<t>
Tag manifests SHOULD use the same algorithms as the payload manifests that are present in the bag.
</t>
<t>Example tag manifest filenames:</t>
<figure>
<artwork>
tagmanifest-sha256.txt
tagmanifest-sha512.txt </artwork>
</figure>
<t>
A tag manifest file has the same form as the payload manifest file
described in <xref target="sec-payload-manifest"/>
but MUST NOT list any payload files.
As a result, no <spanx style="emph">filepath</spanx> listed in a tag manifest begins "data/".
</t>
</section>
<!-- /Tag Manifest -->
<section anchor="sec-bag-info" title="Bag Metadata: bag-info.txt">
<t>
The "bag-info.txt" file is a tag file that contains metadata
elements describing the bag and the payload. The metadata elements
contained in the "bag-info.txt" file are intended primarily for
human use. All metadata elements are OPTIONAL and MAY be repeated.
Because "bag-info.txt" is intended for human reading
and editing, ordering MAY be significant and the ordering of
metadata elements MUST be preserved.
</t>
<t>
A metadata element MUST consist of a label, a colon ":", a single
linear whitespace character (space or tab), and a value that is
terminated with an LF, a CR, or a CRLF.
</t>
<t>
The label MUST NOT contain a colon (:), LF, or CR.
The label MAY contain linear whitespace characters but MUST NOT start or
end with whitespace.
</t>
<t>
It is RECOMMENDED that lines not exceed 79 characters in length. Long values MAY be
continued onto the next line by inserting a LF, CR, or CRLF, and then indenting
the next line with one or more linear white space characters (spaces or tabs).
Except for linebreaks, such padding does not form part of the value.
</t>
<t>
Implementations wishing to support previous BagIt versions
MUST accept multiple linear whitespace characters before and after the
colon when the bag version is earlier than 1.0; such whitespace
does not form part of the label or value.
</t>
<t>
The following are reserved metadata elements. The use of these reserved
metadata elements is OPTIONAL but encouraged. Reserved metadata
element names are case insensitive. Except where indicated otherwise,
these metadata element names MAY be repeated to capture multiple values.
</t>
<t>
<list style="hanging">
<t hangText="Source-Organization:">
Organization transferring the content.
</t>
<t hangText="Organization-Address:">
Mailing address of the source organization.
</t>
<t hangText="Contact-Name:">
Person at the source organization who is responsible for the content
transfer.
</t>
<t hangText="Contact-Phone:">
International format telephone number of person or position responsible.
</t>
<t hangText="Contact-Email:">
Fully qualified email address of person or position responsible.
</t>
<t hangText="External-Description:">
A brief explanation of the contents and provenance.
</t>
<t hangText="Bagging-Date:">
Date (YYYY-MM-DD) that the content was prepared for transfer.
This metadata element SHOULD NOT be repeated.
</t>
<t hangText="External-Identifier:">
A sender-supplied identifier for the bag.
</t>
<t hangText="Bag-Size:">
The size or approximate size of the bag being transferred, followed
by an abbreviation such as MB (megabytes), GB (gigabytes), or
TB (terabytes): for example,
42600 MB, 42.6 GB, or .043 TB. Compared to Payload-Oxum (described
next), Bag-Size is intended for human consumption.
This metadata element SHOULD NOT be repeated.
</t>
<t hangText="Payload-Oxum:">
The "octetstream sum" of the payload, which is intended for the
purpose of quickly detecting incomplete bags before performing checksum
validation. This is strictly an optimization, and implementations MUST perform
the standard checksum validation process before proclaiming a bag to be valid.
This element MUST NOT be present more than once and, if present, MUST
be in the form "<spanx style="emph">OctetCount</spanx>.<spanx style="emph">StreamCount</spanx>",
where <spanx style="emph">OctetCount</spanx> is the total number of
octets (8-bit bytes) across all payload file content and
<spanx style="emph">StreamCount</spanx> is the total number of
payload files.
This metadata element MUST NOT be repeated.
</t>
<t hangText="Bag-Group-Identifier:">
A sender-supplied identifier for the set, if any, of bags
to which it logically belongs.
This identifier SHOULD be unique across the sender's content,
and if it is recognizable as belonging to a globally unique scheme, the receiver
SHOULD make an effort to honor the reference to it.
This metadata element SHOULD NOT be repeated.
</t>
<t hangText="Bag-Count:">
Two numbers separated by "of", in particular, "N of T",
where T is the total number of bags in a group of bags and N is the
ordinal number within the group. If T is not known, specify it as "?"
(question mark): for example, 1 of 2, 4 of 4, 3 of ?, 89 of 145.
This metadata element SHOULD NOT be repeated.
If this metadata element is present, it is RECOMMENDED to also
include the Bag-Group-Identifier element.
</t>
<t hangText="Internal-Sender-Identifier:">
An alternate sender-specific identifier for the content
and/or bag.
</t>
<t hangText="Internal-Sender-Description:">
A sender-local explanation of the contents and provenance.
</t>
</list>
</t>
<t>
In addition to these metadata elements, other arbitrary metadata
elements MAY also be present.
</t>
<figure>
<preamble>An example of "bag-info.txt" file is as follows:</preamble>
<artwork>
Source-Organization: FOO University
Organization-Address: 1 Main St., Cupertino, California, 11111
Contact-Name: Jane Doe
Contact-Phone: +1 111-111-1111
Contact-Email: [email protected]
External-Description: Uncompressed greyscale TIFF images from the
FOO papers colle...
Bagging-Date: 2008-01-15
External-Identifier: university_foo_001
Payload-Oxum: 279164409832.1198
Bag-Group-Identifier: university_foo
Bag-Count: 1 of 15
Internal-Sender-Identifier: /storage/images/foo
Internal-Sender-Description: Uncompressed greyscale TIFFs created
from microfilm and are... </artwork>
</figure>
</section>
<section title="Fetch File: fetch.txt" anchor="sec-fetch-file">
<t>
For reasons of efficiency, a bag MAY be sent with a list of files to be
fetched and added to the payload before it can meaningfully be checked
for completeness.
The fetch file allows a bag to be transmitted with
"holes" in it, which can be practical for several reasons. For example,
it obviates the need for the sender to stage a large serialized copy of
the content while the bag is transferred to the receiver. Also, this
method allows a sender to construct a bag from components that are either
a subset of logically related components (e.g., the localized logical
object could be much larger than what is intended for export) or
assembled from logically distributed sources (e.g., the object components
for export are not stored locally under one filesystem tree).
An OPTIONAL tag file, called the fetch file, contains such a list.
</t>
<t>
The fetch file MUST be named "fetch.txt". Every file listed in
the fetch file MUST be listed in every
payload manifest. A fetch file MUST NOT list any tag files.
</t>
<t>
Each line of a fetch file MUST be of the form
</t>
<figure>
<artwork>url length filepath</artwork>
<postamble>
where <spanx style="emph">url</spanx> identifies the file to be
fetched and MUST be an absolute URI as defined in
<xref target="RFC3986"/>, <spanx style="emph">length</spanx> is
the number of octets in the file (or "-", to leave it unspecified),
and <spanx style="emph">filepath</spanx> identifies the
corresponding payload file, relative to the base directory.
</postamble>
</figure>
<t>
The slash character ('/') MUST be used as a path separator in
<spanx style="emph">filepath</spanx>. One or more linear whitespace
characters (spaces or tabs) MUST separate these
three values, and any such characters in the <spanx style="emph">url</spanx>
MUST be percent-encoded <xref target="RFC3986"/>.
If <spanx style="emph">filename</spanx> includes an LF, a CR,
a CRLF, or a percent sign (%), those characters (and only those) MUST be
percent-encoded as described in <xref target="RFC3986"/>.
There is no
limitation on the length of any of the fields in the fetch file.
</t>
</section>
<!-- /Fetch File -->
<section title="Other Tag Files" anchor="sec-other-tag-files">
<t>
A bag MAY contain other tag files that are not defined by this
document.
Implementations MUST perform standard checksum validation on any tag file
that is listed in a tag manifest but MUST otherwise ignore their contents.
</t>
</section>
<!-- /Other Tag Files -->
</section>
<!-- /Optional Elements -->
<section title="Text Tag File Format" anchor="sec-tag-files">
<t>
All tag files specifically described in this document MUST adhere to
the text tag file format described below. Other tag files MAY adhere to
the text tag file format described below.
</t>
<t>
Text tag files are line oriented, and each line MUST be terminated
by an LF, a CR, or a CRLF. It is RECOMMENDED that the last line in a tag
file also end with LF, CR, or CRLF.
Text tag file names MUST end in the extension ".txt".
</t>
<t>
In all text tag files except for the bag declaration file, text MUST use
the character encoding specified in the "bagit.txt" bag declaration
file. Text tag files except for the bag declaration file MAY include a
Byte Order Mark (BOM) only if the specified encoding requires it for
proper decoding. In accordance with <xref target="RFC3629"/>, when "bagit.txt"
specifies UTF-8, the tag files MUST NOT begin with a BOM.
See <xref target="sec-bag-decl"/>.
</t>
<t>
The use of UTF-8 for text tag files is strongly RECOMMENDED. A future version
of BagIt may disallow encodings other than UTF-8.
</t>
</section>
<!-- /Tags Files -->
<section title="Bag Checksum Algorithms" anchor="bag-checksum-algorithms">
<t>
The payload manifest and tag manifest permit validating the integrity of the payload
and tag files in a bag produced by the checksum algorithms.
Checksum values MUST be encoded so as to conform to the manifest format
specified in <xref target="sec-payload-manifest"/>. However, the internal details
of a checksum are outside the scope of this document.
</t>
<t>
To avoid future ambiguity, the checksum algorithm SHOULD be registered
in IANA's "Named Information Hash Algorithm Registry" <xref target="ni-registry" />
according to <xref target="RFC6920"/> but MAY, for backwards compatibility, also be
MD5 <xref target="RFC1321"/> or SHA-1 <xref target="RFC3174"/>.
</t>
<t>
The name of the checksum algorithm MUST be normalized for use in the
manifest's filename by lowercasing the common name of the algorithm and
removing all non-alphanumeric characters. Following is a partial list
that maps common algorithm names to normalized names:
<list style="symbols">
<t>MD5: md5</t>
<t>SHA-1: sha1</t>
<t>sha-256: sha256</t>
<t>sha-512: sha512</t>
</list></t>
<t>
Starting with BagIt 1.0, bag creation and validation tools MUST support the
SHA-256 and SHA-512 algorithms <xref target="RFC6234"/> and SHOULD enable
SHA-512 by default when creating new bags.
For backwards compatibility, implementers SHOULD support
MD5 <xref target="RFC1321"/> and SHA-1 <xref target="RFC3174"/>.
Implementers are encouraged to simplify the process of adding additional
manifests using new algorithms to streamline the process of in-place
upgrades.
</t>
</section>
<!-- /Bag Checksum Algorithms -->
</section>
<!-- /Bag Structure -->
<section title="Complete and Valid Bags" anchor="sec-complete-valid">
<t>
A <spanx style="emph">complete</spanx> bag MUST meet the following
requirements:
</t>
<t>
<list style="numbers">
<t>Every required element MUST be present (see <xref target="sec-required-elements"/>).</t>
<t>Every file listed in every tag manifest MUST be present.</t>
<t>Every file listed in every payload manifest MUST be present.</t>
<t>For BagIt 1.0, every payload file MUST be listed in every payload manifest.
Note that older versions of BagIt allowed payload files to be
listed in just one of the manifests.
</t>
<t>Every element present MUST conform to BagIt 1.0.</t>
</list>
</t>
<t>
A <spanx style="emph">valid</spanx> bag MUST meet the following requirements:
</t>
<t>
<list style="numbers">
<t>The bag MUST be <spanx style="emph">complete</spanx>.</t>
<t>
Every checksum in every payload manifest and tag manifest has been
successfully verified against the contents of the corresponding file.
</t>
</list>
</t>
</section>
<!-- Completeness and validity -->
<section title="Examples">
<section title="Example of a Basic Bag">
<t>
This is the layout of a basic bag containing an image and a companion
Optical Character Recognition (OCR) file. Lines of file content are shown with added parentheses to
indicate each complete line.
For brevity, this example uses MD5 rather than the recommended SHA-512.
</t>
<figure>
<artwork>
myfirstbag/
|
| manifest-md5.txt
| (49afbd86a1ca9f34b677a3f09655eae9 data/27613-h/images/q172.png)
| (408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/images/q172.txt)
|
| bagit.txt
| (BagIt-version: 1.0 )
| (Tag-File-Character-Encoding: UTF-8 )
|
\--- data/
|
| 27613-h/images/q172.png
| (... image bytes ... )
|
| 27613-h/images/q172.txt
| (... OCR text ... )
.... </artwork>
</figure>
</section>
<section title="Example Bag Using fetch.txt">
<t>
This is the layout of a bag that expects the receiver to download the
files listed in the payload manifests prior to validation. Lines of
file content are shown with added parentheses to indicate each
complete line.
For brevity, this example uses MD5 rather than the recommended SHA-512.
</t>
<figure>
<artwork>
highsmith-tahoe/
|
| manifest-md5.txt
| (102b0e6effe208ef9b29864946de9e22 data/23364a.tif )
|
| fetch.txt
| (https://cdn.loc.gov/master/pnp/highsm/23300/23364a.tif
| 216951362 data/23364a.tif )
|
| bagit.txt
| (BagIt-version: 1.0 )
| (Tag-File-Character-Encoding: UTF-8 )
|
| bag-info.txt
| (Internal-Sender-Description: Download link found at )
| ( https://www.loc.gov/resource/highsm.23364/ )</artwork>
</figure>
</section>
</section>
<!-- /Examples -->
<section title="Security Considerations" anchor="sec-security">
<section title="Special Directory Characters">
<t>
The paths specified in the payload manifests, tag manifests, and
fetch files do not prohibit special directory characters that have
special meaning on some operating systems. Implementers MUST ensure
that files outside the bag directory structure are not accessed when
reading or writing files based on paths specified in a bag.
</t>
<t>
All implementations SHOULD have a test suite to guard against
special directory characters.
</t>
<t>
For example, a maliciously crafted "tagmanifest-sha512.txt" file might
contain entries that begin with a path character such as "/", "..",
or a "~username" home directory reference in an attempt to cause a
naive implementation to leak or overwrite targeted files on a POSIX
operating system.
</t>
<t>
Windows implementations SHOULD test their implementations to ensure
that safety checks prevent use of drive letters and the less commonly used
namespace sequences (e.g., "\\?\C:\...") described in <xref target="MSFNAM"/>.
</t>
<t>
To assist implementers, the Library
of Congress conformance suite <xref target="LC-CONFORMANCE-SUITE" />
has some tests for invalid bags
that are expected to fail on POSIX or Windows clients.
</t>
</section>
<section title="Control of URLs in fetch.txt">
<t>
Implementers of tools that complete bags by retrieving URLs listed in
a fetch file need to be aware that some of those URLs might point
to hosts, intentionally or unintentionally, that are not under control
of the bag's sender. Moreover, older checksum algorithms, even if
reasonable for detecting corruption during transit, may not offer strong
cryptographic protection against intentional spoofing.
</t>
</section>
<section title="File Sizes in fetch.txt">
<t>
The size of files, as optionally reported in the fetch file,
cannot be guaranteed to match the actual file size to be downloaded.
Implementers SHOULD take steps to monitor and abort transfer when the
received file size exceeds the file size reported in the fetch file.
Implementers SHOULD NOT use the file size in the
fetch file for critical resource allocation, such as buffer
sizing or storage requisitioning.
</t>
</section>
<section title="Attacks on Payload File Content">
<t>
The integrity assurance provided by manifests is designed to provide
high levels of confidence against data corruption but is not designed
to be secure against active attacks. Organizations that need to
secure bags against such threats SHOULD agree on additional
measures, such as digital signatures, that are out
of scope for this specification.
</t>
</section>
<!-- End Section: Special directory characters -->
</section>
<!-- End Section: Security considerations -->
<section title="Practical Considerations (Non-normative)">
<section title="Interoperability" anchor="sec-interoperability">
<t>
This section lists practical considerations for implementers and
users. None of the points below are required, but they are recommended
for general-purpose usage.
</t>
<t>
Upon discovering errors in bags, an implementation is free to take action
(for example, logging or reporting) in an application-specific manner.
This document does not mandate any particular action.
</t>
<t>
The Library of Congress conformance suite <xref target="LC-CONFORMANCE-SUITE" />
is provided as a public resource to test new implementations for compatibility and
error handling.
</t>
<section title="Filename Normalization" anchor="filename-normalization">
<t>
This section provides background information on various challenges caused by
differences in how operating systems, filesystems, and common tools handle
filenames. This section is followed by a list of recommendations for implementers in
<xref target="filename-normalization-recommendations"/>.
</t>
<section title="Case Sensitivity">
<t>
There are three challenges for interoperability related to filename case:
<list style="symbols"><t>
Filesystems such as File Allocation Table (FAT) or Extended File
Allocation Table (EXFAT) always convert filenames to uppercase:
"example.txt" will be stored as "EXAMPLE.TXT".
</t><t>
Many Unix filesystems save filenames exactly as provided, which allows
multiple files that differ only in case: "example.txt" and
"Example.txt" are separate files.
</t><t>
New Technology File System (NTFS) and Apple's Hierarchical File System
(HFS) Plus usually preserve case when storing files but are
case insensitive when retrieving them. A file saved as "Example.txt"
will be retrieved by that name but will also be retrieved as
"EXAMPLE.TXT", "example.txt", etc.
</t></list>
</t>
</section>
<section title="Unicode Normalization">
<t>
The Unicode specification has common cases where different character sequences
produce the same human-meaningful text.
These are referred to as "canonically equivalent" and the Unicode
specification defines different normalization forms - see <xref
target="UNICODE-TR15"/> for the full details.</t>
<figure>
<preamble>
The example below shows the common surname "Nunez" normalized in different forms.
</preamble>
<artwork><![CDATA[
Normalization Form D (Decomposition)
Char UTF8 Hex Name
----------------------------------------------
N 4e LATIN CAPITAL LETTER N
u 75 LATIN SMALL LETTER U
\u0301 cc81 COMBINING ACUTE ACCENT
n 6e LATIN SMALL LETTER N
\u0303 cc83 COMBINING TILDE
e 65 LATIN SMALL LETTER E
z 7a LATIN SMALL LETTER Z
Normalization Form C (Canonical Composition)
Char UTF8 Hex Name
----------------------------------------------
N 4e LATIN CAPITAL LETTER N
u c3ba LATIN SMALL LETTER U WITH ACUTE
n c3b1 LATIN SMALL LETTER N WITH TILDE
e 65 LATIN SMALL LETTER E
z 7a LATIN SMALL LETTER Z ]]></artwork>
</figure>
<t>
Unicode normalization is relevant to BagIt implementors because different
systems have different standards for normalization:
<list style="symbols"><t>
Apple's HFS Plus filesystem always normalizes filenames to a
fully decomposed form based on the Unicode 2.0 specification (see <xref target="TN1150"/>).
</t><t>
Windows treats filenames as opaque character sequences (see <xref target="MSFNAM"/>) and will store and return the encoded bytes exactly
as provided.
</t><t>
Linux and other common Unix systems are generally similar to Windows in
storing and returning opaque byte streams, but this behavior is
technically dependent on the filesystem.
</t><t>
Utilities used for file management, transfer, and archiving may ignore this
issue, apply an arbitrary normalization form, or allow the user to control
how normalization is applied.
</t></list>
</t>
<t>
In practice, this means that the encoded filename stored in a manifest may
fail a simple file existence check because the filename's normalization was
changed at some point after the manifest was written. This situation is very
confusing for users because the filenames are visually indistinguishable, and
the "missing" file is obviously present in the payload directory.
</t>
</section>
<section title="Recommendations" anchor="filename-normalization-recommendations">
<t>
<list style="symbols">
<t>
Implementations SHOULD discourage the creation of bags containing
files that differ only in case.
</t>
<t>
Implementations SHOULD prevent the creation of bags containing files
that differ only in normalization form.
</t>
<t>
BagIt implementations SHOULD tolerate differences in normalization
form by comparing both the list of filesystem and manifest names after
applying the same normalization form to both.
</t>
<t>
Implementations SHOULD issue a warning when multiple manifests are
present that differ only in case or normalization form.
</t>