| | Item ResourceThe Item resource, also called a Processed Item, describes an item in the APTrust work queue, and includes the fields described below. Please also note the section at the bottom of this page describing how APTrust handles duplicate items. - id - (int) The item's unique ID.
- created_at - (datetime) When this record was created.
- updated_at - (datetime) When this record was last updated.
- name - (string) The name of the tar file that was originally uploaded into the receiving bucket by the member.
- etag - (string) The etag of the originally-uploaded tar file, as calculated by Amazon's S3 service. For tar files under 5GB that were uploaded in a single transaction, this will be the md5 file's digest. For files uploaded via multipart upload, and all files over 5GB in size, this will look like an md5 checksum, followed by a dash, followed by the number of chunks that comprised the upload and will not match the file's md5 checksum. AWS has some documentation on how etags for multipart uploads are calculated here.
- bucket - (string) The name of the receiving bucket into which the original tar file was uploaded.
- user - (string - nullable) The email address of the user who put the request into the queue. This will appear only on requests for restoration, deletion and copying to DPN.
- institution - (string) The domain name of the institution that owns the bag.
- note - (string) Contains human-readable information about the outcome of the work request. This is generally used to record error messages.
- action - (string) The type of action requested. It will be one of the following, though this list may grow as APTrust adds services:
- Ingest - A new tar file has appeared in your institution's receiving bucket and needs to be ingested. (See Duplicate Items below for some special considerations.)
- Fixity Check - APTrust has internally scheduled a fixity check on our stored copy of your intellectual object. Currently, partners cannot request a fixity check. It's an internal, scheduled operation.
- Restore - Someone at your institution has requested that an object be restored. Restorations can be slow for very large bags, because APTrust must reassemble all of the bag's files, package them, and send them to your restoration bucket.
- Delete - Someone at your institution has requested that an Intellectual Object, or files within that object be deleted.
- DPN - Someone at your institution has requested that an Intellectual Object be copied to DPN.
- stage - (string)
- Requested - The action has been requested, but processing has not yet started.
- Receive - A tar file has been received in your institution's receiving bucket, but the ingest process has not yet started. Note that APTrust scans the receiving buckets on a scheduled basis, which may be every hour in times of light ingest and every several hours in times of heavy ingest. So there will always be some lag time between your uploading a tar file and its appearance in the Items queue.
- Fetch - APTrust is fetching the tar file from your receiving bucket to a local staging area for validation.
- Unpack - APTrust is untarring the tar file before validating its contents. On very large bags, this can take an hour or longer.
- Validate - APTrust is validating the contents of the bag. This includes calculating md5 and sha256 checksums on all the files in the bag, which may take hours for very large bags. Processing ends here for invalid bags. A bag can be invalid if required tags are missing, if tag or manifest files are missing, if the contents of the data directory don't match what's in the manifest, or if file checksums don't match what's in the manifest.
- Store - APTrust is copying the bag's data files to long-term storage.
- Record - APTrust is recording information about the bag in Fedora. Information includes the bag name, title, description and rights; a list of files, their checksums, and where the files are stored; and PREMIS events describing actions taken on the bag and its files.
- Cleanup - APTrust is deleting the bag and its contents from the local staging area and from your receiving bucket. Bags are deleted from the receiving bucket only after successful ingest.
- Resolve - The action has been completed or cancelled, and no further action will be taken.
- status - (string)
- Pending - The item has not yet entered the stage specified in the stage property. It's waiting for an available worker to pick it up.
- Started - The item is currently being processed in whatever stage is set in the stage property. For actions other than Ingest, the combination of stage=Requested and status=Started means the operation is underway.
- Success - The action was completed successfully.
- Failed - The action failed.
- Cancelled - The action was cancelled.
- outcome - (string - nullable) This is usually a copy of whatever is in the status field. This field is obsolete and may be removed in a future release.
- bag_date - (datetime) The date and time the bag was created at your institution.
- date - (datetime) The date and time the bag was received in your institution's S3 receiving bucket.
- retry - (boolean) A flag indicating whether APTrust should try again to process this item. When processing fails due to transient errors, such as network outages, low disk space, etc., APTrust pushes items back into the queue with this flag set to true. This flag will be set to false if processing failed because of a fatal error, such as a bag failing validation.
- reviewed - (boolean) This flag indicates that someone at your institution has reviewed an item in the queue. You can flag items as reviewed in the APTrust web UI, and then later ask to see only those items you have not yet reviewed.
- object_identifier - (string - nullable) If an object has been successfully ingested, this field will contain the APTrust object identifier, in the format institution.edu/bag_name. That's the institution's domain name, followed by a slash, followed by the bag name.
- generic_file_identifier - (string - nullable) This field is set only when the action pertains to an individual file. For example, fixity checks and file deletions.
Duplicate ItemsYou may occasionally see duplicate items in Items queue. This happens when you upload a tar file to the receiving bucket, then later upload another tar file with the same name. S3 puts a timestamp on each file describing when the file was last copied into the bucket. APTrust stores that timestamp in the date field of the Item resource.
If you upload a tar file with the same name more than once, and the etag does not change, only one copy of that tar file will ever be ingested, because we can assume that the contents of the bag have not changed. (Remember that the S3 etag is the md5 checksum of the tar file, if the tar file is under 5GB in size and was not a multipart upload.)
If the etag has changed, APTrust will re-ingest the new bag, because we have to assume that the contents of the bag have changed. (This isn't always true, though we have to assume it is. Tar files sent to S3 via multi-part upload can have a different etag every time they're uploaded, even if the contents don't change.)
During times of heavy ingest, APTrust's processing backlog may run to 48 hours or more. If you upload a bag three times in that 48-hour period, with changes each time, you'll see three entries in the Item queue for that bag, with three different etags. When the system gets around to processing the bag, it will only process the version of the bag with the latest date stamp. The two earlier versions of the bag will never be ingested, and may appear in the Receive/Pending state for some time.
|
Updated on Mar 31, 2016 by Andrew Diamond (Version 8) |