inspirehep.modules.workflows.tasks package¶
Submodules¶
inspirehep.modules.workflows.tasks.actions module¶
Tasks related to user actions.
-
inspirehep.modules.workflows.tasks.actions.
add_core
(*args, **kwargs)[source]¶ Mark a record as CORE if it was approved as CORE.
-
inspirehep.modules.workflows.tasks.actions.
count_reference_coreness
(*args, **kwargs)[source]¶ Count number of core/non-core matched references.
-
inspirehep.modules.workflows.tasks.actions.
error_workflow
(message)[source]¶ Force an error in the workflow with the given message.
-
inspirehep.modules.workflows.tasks.actions.
fix_submission_number
(*args, **kwargs)[source]¶ Ensure that the submission number contains the workflow object id.
Unlike form submissions, records coming from HEPCrawl can’t know yet which workflow object they will create, so they use the crawler job id as their submission number. We would like to have there instead the id of the workflow object from which they came from, so that, given a record, we can link to their original Holding Pen entry.
Parameters: - obj – a workflow object.
- eng – a workflow engine.
Returns: None
-
inspirehep.modules.workflows.tasks.actions.
halt_record
(action=None, message=None)[source]¶ Halt the workflow for approval with optional action.
-
inspirehep.modules.workflows.tasks.actions.
in_production_mode
(*args, **kwargs)[source]¶ Check if we are in production mode
-
inspirehep.modules.workflows.tasks.actions.
is_arxiv_paper
(*args, **kwargs)[source]¶ Check if a workflow contains a paper from arXiv.
Parameters: - obj – a workflow object.
- eng – a workflow engine.
Returns: whether the workflow contains a paper from arXiv.
Return type:
-
inspirehep.modules.workflows.tasks.actions.
is_experimental_paper
(*args, **kwargs)[source]¶ Check if a workflow contains an experimental paper.
Parameters: - obj – a workflow object.
- eng – a workflow engine.
Returns: whether the workflow contains an experimental paper.
Return type:
-
inspirehep.modules.workflows.tasks.actions.
is_marked
(key)[source]¶ Check if the workflow object has a specific mark.
-
inspirehep.modules.workflows.tasks.actions.
is_record_accepted
(*args, **kwargs)[source]¶ Check if the record was approved.
-
inspirehep.modules.workflows.tasks.actions.
is_record_relevant
(*args, **kwargs)[source]¶ Shall we halt this workflow for potential acceptance or just reject?
-
inspirehep.modules.workflows.tasks.actions.
is_submission
(*args, **kwargs)[source]¶ Check if a workflow contains a submission.
Parameters: - obj – a workflow object.
- eng – a workflow engine.
Returns: whether the workflow contains a submission.
Return type:
-
inspirehep.modules.workflows.tasks.actions.
jlab_ticket_needed
(*args, **kwargs)[source]¶ Check if the a JLab curation ticket is needed.
-
inspirehep.modules.workflows.tasks.actions.
load_from_source_data
(*args, **kwargs)[source]¶ Restore the workflow data and extra_data from source_data.
-
inspirehep.modules.workflows.tasks.actions.
mark
(key, value)[source]¶ Mark the workflow object by putting a value in a key in extra_data.
Note
Important. Committing a change to the database before saving the current workflow object will wipe away any content in
extra_data
not saved previously.Parameters: - key – the key used to mark the workflow
- value – the value assigned to the key
Returns: the decorator to decorate a workflow object
Return type: func
-
inspirehep.modules.workflows.tasks.actions.
normalize_journal_titles
(*args, **kwargs)[source]¶ Normalize the journal titles
Normalize the journal titles stored in the journal_title field of each object contained in publication_info.
Note
The DB is queried in order to get the $ref of each journal and add it in journal_record.
Todo
Refactor: it must be checked that normalize_journal_title is appropriate.
Parameters: - obj – a workflow object.
- eng – a workflow engine.
Returns: None
-
inspirehep.modules.workflows.tasks.actions.
populate_journal_coverage
(*args, **kwargs)[source]¶ Populate
journal_coverage
from the Journals DB.Searches in the Journals DB if the current article was published in a journal that we harvest entirely, then populates the
journal_coverage
key inextra_data
with'full'
if it was, ``‘partial’ otherwise.Parameters: - obj – a workflow object.
- eng – a workflow engine.
Returns: None
-
inspirehep.modules.workflows.tasks.actions.
preserve_root
(*args, **kwargs)[source]¶ Save the current workflow payload to be used as root for the merger.
Parameters: - obj – a workflow object.
- eng – a workflow engine.
Returns: None
-
inspirehep.modules.workflows.tasks.actions.
refextract
(*args, **kwargs)[source]¶ Extract references from various sources and add them to the workflow.
Runs
refextract
on both the PDF attached to the workflow and the references provided by the submitter, if any, then chooses the one that generated the most and attaches them to the workflow object.Parameters: - obj – a workflow object.
- eng – a workflow engine.
Returns: None
-
inspirehep.modules.workflows.tasks.actions.
reject_record
(message)[source]¶ Reject record with message.
-
inspirehep.modules.workflows.tasks.actions.
save_workflow
(*args, **kwargs)[source]¶ Save the current workflow.
Saves the changes applied to the given workflow object in the database.
Note
The
save
function only indexes the current workflow. For this reason, we need todb.session.commit()
.Todo
Refactor: move this logic inside
WorkflowObject.save()
.Parameters: - obj – a workflow object.
- eng – a workflow engine.
Returns: None
-
inspirehep.modules.workflows.tasks.actions.
set_refereed_and_fix_document_type
(*args, **kwargs)[source]¶ Set the
refereed
field using the Journals DB.Searches in the Journals DB if the current article was published in journals that we know for sure to be peer-reviewed, or that publish both peer-reviewed and non peer-reviewed content but for which we can infer that it belongs to the former category, and sets the
refereed
key indata
toTrue
if that was the case. If instead we know for sure that all journals in which it published are not peer-reviewed we set it toFalse
.Also replaces the
article
document type withconference paper
if the paper was only published in non refereed proceedings.Parameters: - obj – a workflow object.
- eng – a workflow engine.
Returns: None
inspirehep.modules.workflows.tasks.arxiv module¶
Tasks used in OAI harvesting for arXiv record manipulation.
Extract authors from any author XML found in the arXiv archive.
Parameters: - obj – Workflow Object to process
- eng – Workflow Engine processing the object
-
inspirehep.modules.workflows.tasks.arxiv.
arxiv_derive_inspire_categories
(*args, **kwargs)[source]¶ Derive
inspire_categories
from the arXiv categories.Uses side effects to populate the
inspire_categories
key inobj.data
by converting its arXiv categories.Parameters: - obj (WorkflowObject) – a workflow object.
- eng (WorkflowEngine) – a workflow engine.
Returns: None
-
inspirehep.modules.workflows.tasks.arxiv.
arxiv_package_download
(*args, **kwargs)[source]¶ Perform the package download step for arXiv records.
Parameters: - obj – Workflow Object to process
- eng – Workflow Engine processing the object
inspirehep.modules.workflows.tasks.beard module¶
Set of workflow tasks for beard API.
-
inspirehep.modules.workflows.tasks.beard.
get_beard_url
()[source]¶ Return the BEARD URL endpoint, if any.
inspirehep.modules.workflows.tasks.classifier module¶
Set of tasks for classification.
-
inspirehep.modules.workflows.tasks.classifier.
classify_paper
(taxonomy=None, rebuild_cache=False, no_cache=False, output_limit=20, spires=False, match_mode='full', with_author_keywords=False, extract_acronyms=False, only_core_tags=False, fast_mode=False)[source]¶ Extract keywords from a pdf file or metadata in a OAI harvest.
inspirehep.modules.workflows.tasks.magpie module¶
Set of workflow tasks for MagPie API.
-
inspirehep.modules.workflows.tasks.magpie.
filter_magpie_response
(labels, limit)[source]¶ Filter response from Magpie API, keeping most relevant labels.
-
inspirehep.modules.workflows.tasks.magpie.
get_magpie_url
()[source]¶ Return the Magpie URL endpoint, if any.
-
inspirehep.modules.workflows.tasks.magpie.
guess_categories
(*args, **kwargs)[source]¶ Workflow task to ask Magpie API for a subject area assessment.
-
inspirehep.modules.workflows.tasks.magpie.
guess_experiments
(*args, **kwargs)[source]¶ Workflow task to ask Magpie API for a subject area assessment.
inspirehep.modules.workflows.tasks.manual_merging module¶
Tasks related to manual merging.
-
inspirehep.modules.workflows.tasks.manual_merging.
halt_for_merge_approval
(*args, **kwargs)[source]¶ Wait for curator approval.
Pauses the workflow using the
merge_approval
action, which is resolved whenever the curator says that the conflicts have been solved.Parameters: - obj – a workflow object.
- eng – a workflow engine.
Returns: None
-
inspirehep.modules.workflows.tasks.manual_merging.
merge_records
(*args, **kwargs)[source]¶ Perform a manual merge.
Merges two records stored in the workflow object as the content of the
head
andupdate
keys, and stores the result inobj.data
. Also stores the eventual conflicts inobj.extra_data['conflicts']
.Because this is a manual merge we assume that the two records have no common ancestor, so
root
is the empty dictionary.Parameters: - obj – a workflow object.
- eng – a workflow engine.
Returns: None
-
inspirehep.modules.workflows.tasks.manual_merging.
save_roots
(*args, **kwargs)[source]¶ Save and update the head roots and delete the update roots from the db.
If both head and update have a root from a given source, then the older one is removed and the newer one is assigned tot the head. Otherwise, assign the update roots from sources that are missing among the head roots to the head. i.e. it is an union-like operation.
Parameters: - obj – a workflow object.
- eng – a workflow engine.
Returns: None
-
inspirehep.modules.workflows.tasks.manual_merging.
store_records
(*args, **kwargs)[source]¶ Store the records involved in the manual merge.
Performs the following steps:
- Updates the
head
so that it contains the result of the merge. - Marks the
update
as merged with thehead
and deletes it. - Populates the
deleted_records
andnew_record
keys in, respectively,head
andupdate
so that they contain a JSON reference to each other.
Todo
The last step should be performed by the
merge
method itself.Parameters: - obj – a workflow object.
- eng – a workflow engine.
Returns: None
- Updates the
inspirehep.modules.workflows.tasks.matching module¶
Tasks to check if the incoming record already exist.
-
inspirehep.modules.workflows.tasks.matching.
auto_approve
(obj, eng)[source]¶ Check if auto approve the current ingested article.
Parameters: - obj – a workflow object.
- eng – a workflow engine.
Returns: True when the record belongs to an arXiv category that is fully harvested or if the primary category is physics.data-an, otherwise False.
Return type:
-
inspirehep.modules.workflows.tasks.matching.
delete_self_and_stop_processing
(*args, **kwargs)[source]¶ Delete both versions of itself and stops the workflow.
-
inspirehep.modules.workflows.tasks.matching.
exact_match
(*args, **kwargs)[source]¶ Return
True
if the record is already present in the system.Uses the default configuration of the
inspire-matcher
to find duplicates of the current workflow object in the system.Also sets the
matches.exact
property inextra_data
to the list of control numbers that matched.Parameters: - obj – a workflow object.
- eng – a workflow engine.
Returns: True
if the workflow object has a duplicate in the systemFalse
otherwise.Return type:
-
inspirehep.modules.workflows.tasks.matching.
fuzzy_match
(*args, **kwargs)[source]¶ Return
True
if a similar record is found in the system.Uses a custom configuration for
inspire-matcher
to find records similar to the current workflow object’s payload in the system.Also sets the
matches.fuzzy
property inextra_data
to the list of the brief of first 5 record that matched.Parameters: - obj – a workflow object.
- eng – a workflow engine.
Returns: True
if the workflow object has a duplicate in the systemFalse
otherwise.Return type:
-
inspirehep.modules.workflows.tasks.matching.
has_fully_harvested_category
(record)[source]¶ Check if the record in obj.data has fully harvested categories.
Parameters: record (dict) – the ingested article. Returns: True when the record belongs to an arXiv category that is fully harvested, otherwise False. Return type: bool
-
inspirehep.modules.workflows.tasks.matching.
has_more_than_one_exact_match
(*args, **kwargs)[source]¶ Does the record have more than one exact match.
-
inspirehep.modules.workflows.tasks.matching.
has_same_source
(extra_data_key)[source]¶ Match a workflow in obj.extra_data[extra_data_key] by the source.
Takes a list of workflows from extra_data using as key extra_data_key and goes through them checking if at least one workflow has the same source of the current workflow object.
Parameters: - extra_data_key – the key to retrieve a workflow list from the current
- object. (workflow) –
Returns: True if a workflow, whose id is in obj.extra_data[ extra_data_key], matches the current workflow by the source.
Return type:
-
inspirehep.modules.workflows.tasks.matching.
is_fuzzy_match_approved
(*args, **kwargs)[source]¶ Check if a fuzzy match has been approved by a human.
-
inspirehep.modules.workflows.tasks.matching.
match_non_completed_wf_in_holdingpen
(obj, eng)[source]¶ Return
True
if a matching wf is processing in the HoldingPen.Uses a custom configuration of the
inspire-matcher
to find duplicates of the current workflow object in the Holding Pen not in the COMPLETED state.Also sets
holdingpen_matches
inextra_data
to the list of ids that matched.Parameters: - obj – a workflow object.
- eng – a workflow engine.
Returns: True
if the workflow object has a duplicate in the Holding Pen that is not COMPLETED,False
otherwise.Return type:
-
inspirehep.modules.workflows.tasks.matching.
match_previously_rejected_wf_in_holdingpen
(obj, eng)[source]¶ Return
True
if matches a COMPLETED and rejected wf in the HoldingPen.Uses a custom configuration of the
inspire-matcher
to find duplicates of the current workflow object in the Holding Pen in the COMPLETED state, marked asapproved = False
.Also sets
holdingpen_matches
inextra_data
to the list of ids that matched.Parameters: - obj – a workflow object.
- eng – a workflow engine.
Returns: True
if the workflow object has a duplicate in the Holding Pen that is not COMPLETED,False
otherwise.Return type:
-
inspirehep.modules.workflows.tasks.matching.
pending_in_holding_pen
(*args, **kwargs)[source]¶ Return the list of matching workflows in the holdingpen.
Matches the holdingpen records by their
arxiv_eprint
, theirdoi
, and by a custom validator function.Parameters: - obj – a workflow object.
- validation_func – a function used to filter the matched records.
Returns: the ids matching the current
obj
that satisfyvalidation_func
.Return type: (list)
-
inspirehep.modules.workflows.tasks.matching.
raise_if_match_wf_in_error_or_initial
(obj, eng)[source]¶ Raise if a matching wf is in ERROR or INITIAL state in the HoldingPen.
Uses a custom configuration of the
inspire-matcher
to find duplicates of the current workflow object in the Holding Pen not in the that are in ERROR or INITIAL state.If any match is found, it sets
error_workflows_matched
inextra_data
to the list of ids that matched and raise an error.Parameters: - obj – a workflow object.
- eng – a workflow engine.
Returns: None
-
inspirehep.modules.workflows.tasks.matching.
set_core_in_extra_data
(*args, **kwargs)[source]¶ Set core=True in obj.extra_data if the record belongs to a core arXiv category
-
inspirehep.modules.workflows.tasks.matching.
set_exact_match_as_approved_in_extradata
(*args, **kwargs)[source]¶ Set the best match in matches.approved in extra_data.
-
inspirehep.modules.workflows.tasks.matching.
set_fuzzy_match_approved_in_extradata
(*args, **kwargs)[source]¶ Set the human approved match in matches.approved in extra_data.
-
inspirehep.modules.workflows.tasks.matching.
stop_matched_holdingpen_wfs
(obj, eng)[source]¶ Stop the matched workflow objects in the holdingpen.
Stops the matched workflows in the holdingpen by replacing their steps with a new one defined on the fly, containing a
stop
step, and executing it. For traceability reason, these workflows are also marked as'stopped-by-wf'
, whose value is the current workflow’s id.In the use case of harvesting twice an article, this function is involved to stop the first workflow and let the current one being processed, since it the latest metadata.
Parameters: - obj – a workflow object.
- eng – a workflow engine.
Returns: None
inspirehep.modules.workflows.tasks.merging module¶
Tasks related to record merging.
-
inspirehep.modules.workflows.tasks.merging.
has_conflicts
(*args, **kwargs)[source]¶ Return if the workflow has any confilicts.
-
inspirehep.modules.workflows.tasks.merging.
merge_articles
(*args, **kwargs)[source]¶ Merge two articles.
The workflow payload is overwritten by the merged record, the conflicts are stored in
extra_data.conflicts
. Also, it adds acallback_url
which contains the endpoint which resolves the merge conflicts.Note
When the feature flag
FEATURE_FLAG_ENABLE_MERGER
isFalse
it will skip the merge.
inspirehep.modules.workflows.tasks.refextract module¶
Workflow tasks using refextract API.
-
inspirehep.modules.workflows.tasks.refextract.
extract_journal_info
(*args, **kwargs)[source]¶ Extract the journal information from
pubinfo_freetext
.Runs
extract_journal_reference
on thepubinfo_freetext
key of eachpublication_info
, if it exists, and uses the extracted information to populate the other keys.Parameters: - obj – a workflow object.
- eng – a workflow engine.
Returns: None
-
inspirehep.modules.workflows.tasks.refextract.
extract_references_from_pdf
(*args, **kwargs)[source]¶ Extract references from PDF and return in INSPIRE format.
-
inspirehep.modules.workflows.tasks.refextract.
extract_references_from_raw_ref
(reference, custom_kbs_file=None)[source]¶ Extract references from raw references in reference element.
Parameters: - reference (dict) – a schema-compliant element of the
references
field. If it already contains a structured reference (that is, areference
key), no further processing is done. Otherwise, the contents of theraw_refs
is extracted byrefextract
. - custom_kbs_file (dict) – configuration for refextract knowledge bases.
Returns: a list of schema-compliant elements of the
references
field, with all previously unextracted references extracted.Return type: List[dict]
Note
This function returns a list of references because one raw reference might correspond to several references.
- reference (dict) – a schema-compliant element of the
-
inspirehep.modules.workflows.tasks.refextract.
extract_references_from_raw_refs
(*args, **kwargs)[source]¶ Extract references from raw references in reference list.
Parameters: - references (List[dict]) – a schema-compliant
references
field. If an element already contains a structured reference (that is, areference
key), it is not modified. Otherwise, the contents of theraw_refs
is extracted byrefextract
. - custom_kbs_file (dict) – configuration for refextract knowledge bases.
Returns: a schema-compliant
references
field, with all previously unextracted references extracted.Return type: List[dict]
- references (List[dict]) – a schema-compliant
inspirehep.modules.workflows.tasks.submission module¶
Contains INSPIRE specific submission tasks.
-
inspirehep.modules.workflows.tasks.submission.
cleanup_pending_workflow
(*args, **kwargs)[source]¶ Cleans up the pending workflow entry for this workflow if any.
-
inspirehep.modules.workflows.tasks.submission.
close_ticket
(ticket_id_key='ticket_id')[source]¶ Close the ticket associated with this record found in given key.
-
inspirehep.modules.workflows.tasks.submission.
create_ticket
(template, context_factory=None, queue='Test', ticket_id_key='ticket_id')[source]¶ Create a ticket for the submission.
Creates the ticket in the given queue and stores the ticket ID in the extra_data key specified in ticket_id_key.
-
inspirehep.modules.workflows.tasks.submission.
filter_keywords
(*args, **kwargs)[source]¶ Removes non-accepted keywords from the metadata
-
inspirehep.modules.workflows.tasks.submission.
prepare_keywords
(*args, **kwargs)[source]¶ Prepares the keywords in the correct format to be sent
-
inspirehep.modules.workflows.tasks.submission.
reply_ticket
(template=None, context_factory=None, keep_new=False)[source]¶ Reply to a ticket for the submission.
-
inspirehep.modules.workflows.tasks.submission.
send_robotupload
(url=None, callback_url='callback/workflows/robotupload', mode='insert', extra_data_key=None)[source]¶ Get the MARCXML from the model and ship it.
If callback_url is set the workflow will halt and the callback is responsible for resuming it.
inspirehep.modules.workflows.tasks.upload module¶
Tasks related to record uploading.
-
inspirehep.modules.workflows.tasks.upload.
set_schema
(*args, **kwargs)[source]¶ Make sure schema is set properly and resolve it.
Module contents¶
Workflows tasks.