inspirehep.modules.workflows.tasks package

Submodules

inspirehep.modules.workflows.tasks.actions module

Tasks related to user actions.

inspirehep.modules.workflows.tasks.actions.add_core(*args, **kwargs)[source]

Mark a record as CORE if it was approved as CORE.

inspirehep.modules.workflows.tasks.actions.count_reference_coreness(*args, **kwargs)[source]

Count number of core/non-core matched references.

inspirehep.modules.workflows.tasks.actions.download_documents(*args, **kwargs)[source]
inspirehep.modules.workflows.tasks.actions.error_workflow(message)[source]

Force an error in the workflow with the given message.

inspirehep.modules.workflows.tasks.actions.fix_submission_number(*args, **kwargs)[source]

Ensure that the submission number contains the workflow object id.

Unlike form submissions, records coming from HEPCrawl can’t know yet which workflow object they will create, so they use the crawler job id as their submission number. We would like to have there instead the id of the workflow object from which they came from, so that, given a record, we can link to their original Holding Pen entry.

Parameters:
  • obj – a workflow object.
  • eng – a workflow engine.
Returns:

None

inspirehep.modules.workflows.tasks.actions.halt_record(action=None, message=None)[source]

Halt the workflow for approval with optional action.

inspirehep.modules.workflows.tasks.actions.in_production_mode(*args, **kwargs)[source]

Check if we are in production mode

inspirehep.modules.workflows.tasks.actions.is_arxiv_paper(*args, **kwargs)[source]

Check if a workflow contains a paper from arXiv.

Parameters:
  • obj – a workflow object.
  • eng – a workflow engine.
Returns:

whether the workflow contains a paper from arXiv.

Return type:

bool

inspirehep.modules.workflows.tasks.actions.is_experimental_paper(*args, **kwargs)[source]

Check if a workflow contains an experimental paper.

Parameters:
  • obj – a workflow object.
  • eng – a workflow engine.
Returns:

whether the workflow contains an experimental paper.

Return type:

bool

inspirehep.modules.workflows.tasks.actions.is_marked(key)[source]

Check if the workflow object has a specific mark.

inspirehep.modules.workflows.tasks.actions.is_record_accepted(*args, **kwargs)[source]

Check if the record was approved.

inspirehep.modules.workflows.tasks.actions.is_record_relevant(*args, **kwargs)[source]

Shall we halt this workflow for potential acceptance or just reject?

inspirehep.modules.workflows.tasks.actions.is_submission(*args, **kwargs)[source]

Check if a workflow contains a submission.

Parameters:
  • obj – a workflow object.
  • eng – a workflow engine.
Returns:

whether the workflow contains a submission.

Return type:

bool

inspirehep.modules.workflows.tasks.actions.jlab_ticket_needed(*args, **kwargs)[source]

Check if the a JLab curation ticket is needed.

inspirehep.modules.workflows.tasks.actions.load_from_source_data(*args, **kwargs)[source]

Restore the workflow data and extra_data from source_data.

inspirehep.modules.workflows.tasks.actions.mark(key, value)[source]

Mark the workflow object by putting a value in a key in extra_data.

Note

Important. Committing a change to the database before saving the current workflow object will wipe away any content in extra_data not saved previously.

Parameters:
  • key – the key used to mark the workflow
  • value – the value assigned to the key
Returns:

the decorator to decorate a workflow object

Return type:

func

inspirehep.modules.workflows.tasks.actions.normalize_journal_titles(*args, **kwargs)[source]

Normalize the journal titles

Normalize the journal titles stored in the journal_title field of each object contained in publication_info.

Note

The DB is queried in order to get the $ref of each journal and add it in journal_record.

Todo

Refactor: it must be checked that normalize_journal_title is appropriate.

Parameters:
  • obj – a workflow object.
  • eng – a workflow engine.
Returns:

None

inspirehep.modules.workflows.tasks.actions.populate_journal_coverage(*args, **kwargs)[source]

Populate journal_coverage from the Journals DB.

Searches in the Journals DB if the current article was published in a journal that we harvest entirely, then populates the journal_coverage key in extra_data with 'full' if it was, ``‘partial’ otherwise.

Parameters:
  • obj – a workflow object.
  • eng – a workflow engine.
Returns:

None

inspirehep.modules.workflows.tasks.actions.populate_submission_document(*args, **kwargs)[source]
inspirehep.modules.workflows.tasks.actions.preserve_root(*args, **kwargs)[source]

Save the current workflow payload to be used as root for the merger.

Parameters:
  • obj – a workflow object.
  • eng – a workflow engine.
Returns:

None

inspirehep.modules.workflows.tasks.actions.refextract(*args, **kwargs)[source]

Extract references from various sources and add them to the workflow.

Runs refextract on both the PDF attached to the workflow and the references provided by the submitter, if any, then chooses the one that generated the most and attaches them to the workflow object.

Parameters:
  • obj – a workflow object.
  • eng – a workflow engine.
Returns:

None

inspirehep.modules.workflows.tasks.actions.reject_record(message)[source]

Reject record with message.

inspirehep.modules.workflows.tasks.actions.save_workflow(*args, **kwargs)[source]

Save the current workflow.

Saves the changes applied to the given workflow object in the database.

Note

The save function only indexes the current workflow. For this reason, we need to db.session.commit().

Todo

Refactor: move this logic inside WorkflowObject.save().

Parameters:
  • obj – a workflow object.
  • eng – a workflow engine.
Returns:

None

inspirehep.modules.workflows.tasks.actions.set_refereed_and_fix_document_type(*args, **kwargs)[source]

Set the refereed field using the Journals DB.

Searches in the Journals DB if the current article was published in journals that we know for sure to be peer-reviewed, or that publish both peer-reviewed and non peer-reviewed content but for which we can infer that it belongs to the former category, and sets the refereed key in data to True if that was the case. If instead we know for sure that all journals in which it published are not peer-reviewed we set it to False.

Also replaces the article document type with conference paper if the paper was only published in non refereed proceedings.

Parameters:
  • obj – a workflow object.
  • eng – a workflow engine.
Returns:

None

inspirehep.modules.workflows.tasks.actions.shall_halt_workflow(*args, **kwargs)[source]

Check if the workflow shall be halted.

inspirehep.modules.workflows.tasks.actions.validate_record(schema)[source]

inspirehep.modules.workflows.tasks.arxiv module

Tasks used in OAI harvesting for arXiv record manipulation.

inspirehep.modules.workflows.tasks.arxiv.arxiv_author_list(stylesheet='authorlist2marcxml.xsl')[source]

Extract authors from any author XML found in the arXiv archive.

Parameters:
  • obj – Workflow Object to process
  • eng – Workflow Engine processing the object
inspirehep.modules.workflows.tasks.arxiv.arxiv_derive_inspire_categories(*args, **kwargs)[source]

Derive inspire_categories from the arXiv categories.

Uses side effects to populate the inspire_categories key in obj.data by converting its arXiv categories.

Parameters:
  • obj (WorkflowObject) – a workflow object.
  • eng (WorkflowEngine) – a workflow engine.
Returns:

None

inspirehep.modules.workflows.tasks.arxiv.arxiv_package_download(*args, **kwargs)[source]

Perform the package download step for arXiv records.

Parameters:
  • obj – Workflow Object to process
  • eng – Workflow Engine processing the object
inspirehep.modules.workflows.tasks.arxiv.arxiv_plot_extract(*args, **kwargs)[source]

Extract plots from an arXiv archive.

Parameters:
  • obj – Workflow Object to process
  • eng – Workflow Engine processing the object
inspirehep.modules.workflows.tasks.arxiv.populate_arxiv_document(*args, **kwargs)[source]

inspirehep.modules.workflows.tasks.beard module

Set of workflow tasks for beard API.

inspirehep.modules.workflows.tasks.beard.get_beard_url()[source]

Return the BEARD URL endpoint, if any.

inspirehep.modules.workflows.tasks.beard.guess_coreness(*args, **kwargs)[source]

Workflow task to ask Beard API for a coreness assessment.

inspirehep.modules.workflows.tasks.beard.prepare_payload(record)[source]

Prepare payload to send to Beard API.

inspirehep.modules.workflows.tasks.classifier module

Set of tasks for classification.

inspirehep.modules.workflows.tasks.classifier.classify_paper(taxonomy=None, rebuild_cache=False, no_cache=False, output_limit=20, spires=False, match_mode='full', with_author_keywords=False, extract_acronyms=False, only_core_tags=False, fast_mode=False)[source]

Extract keywords from a pdf file or metadata in a OAI harvest.

inspirehep.modules.workflows.tasks.classifier.clean_instances_from_data(output)[source]

Check if specific keys are of InstanceType and replace them with their id.

inspirehep.modules.workflows.tasks.classifier.filter_core_keywords(*args, **kwargs)[source]

Filter core keywords.

inspirehep.modules.workflows.tasks.magpie module

Set of workflow tasks for MagPie API.

inspirehep.modules.workflows.tasks.magpie.filter_magpie_response(labels, limit)[source]

Filter response from Magpie API, keeping most relevant labels.

inspirehep.modules.workflows.tasks.magpie.get_magpie_url()[source]

Return the Magpie URL endpoint, if any.

inspirehep.modules.workflows.tasks.magpie.guess_categories(*args, **kwargs)[source]

Workflow task to ask Magpie API for a subject area assessment.

inspirehep.modules.workflows.tasks.magpie.guess_experiments(*args, **kwargs)[source]

Workflow task to ask Magpie API for a subject area assessment.

inspirehep.modules.workflows.tasks.magpie.guess_keywords(*args, **kwargs)[source]

Workflow task to ask Magpie API for a keywords assessment.

inspirehep.modules.workflows.tasks.magpie.prepare_magpie_payload(record, corpus)[source]

Prepare payload to send to Magpie API.

inspirehep.modules.workflows.tasks.manual_merging module

Tasks related to manual merging.

inspirehep.modules.workflows.tasks.manual_merging.halt_for_merge_approval(*args, **kwargs)[source]

Wait for curator approval.

Pauses the workflow using the merge_approval action, which is resolved whenever the curator says that the conflicts have been solved.

Parameters:
  • obj – a workflow object.
  • eng – a workflow engine.
Returns:

None

inspirehep.modules.workflows.tasks.manual_merging.merge_records(*args, **kwargs)[source]

Perform a manual merge.

Merges two records stored in the workflow object as the content of the head and update keys, and stores the result in obj.data. Also stores the eventual conflicts in obj.extra_data['conflicts'].

Because this is a manual merge we assume that the two records have no common ancestor, so root is the empty dictionary.

Parameters:
  • obj – a workflow object.
  • eng – a workflow engine.
Returns:

None

inspirehep.modules.workflows.tasks.manual_merging.save_roots(*args, **kwargs)[source]

Save and update the head roots and delete the update roots from the db.

If both head and update have a root from a given source, then the older one is removed and the newer one is assigned tot the head. Otherwise, assign the update roots from sources that are missing among the head roots to the head. i.e. it is an union-like operation.

Parameters:
  • obj – a workflow object.
  • eng – a workflow engine.
Returns:

None

inspirehep.modules.workflows.tasks.manual_merging.store_records(*args, **kwargs)[source]

Store the records involved in the manual merge.

Performs the following steps:

  1. Updates the head so that it contains the result of the merge.
  2. Marks the update as merged with the head and deletes it.
  3. Populates the deleted_records and new_record keys in, respectively, head and update so that they contain a JSON reference to each other.

Todo

The last step should be performed by the merge method itself.

Parameters:
  • obj – a workflow object.
  • eng – a workflow engine.
Returns:

None

inspirehep.modules.workflows.tasks.matching module

Tasks to check if the incoming record already exist.

inspirehep.modules.workflows.tasks.matching.auto_approve(obj, eng)[source]

Check if auto approve the current ingested article.

Parameters:
  • obj – a workflow object.
  • eng – a workflow engine.
Returns:

True when the record belongs to an arXiv category that is fully harvested or if the primary category is physics.data-an, otherwise False.

Return type:

bool

inspirehep.modules.workflows.tasks.matching.delete_self_and_stop_processing(*args, **kwargs)[source]

Delete both versions of itself and stops the workflow.

inspirehep.modules.workflows.tasks.matching.exact_match(*args, **kwargs)[source]

Return True if the record is already present in the system.

Uses the default configuration of the inspire-matcher to find duplicates of the current workflow object in the system.

Also sets the matches.exact property in extra_data to the list of control numbers that matched.

Parameters:
  • obj – a workflow object.
  • eng – a workflow engine.
Returns:

True if the workflow object has a duplicate in the system False otherwise.

Return type:

bool

inspirehep.modules.workflows.tasks.matching.fuzzy_match(*args, **kwargs)[source]

Return True if a similar record is found in the system.

Uses a custom configuration for inspire-matcher to find records similar to the current workflow object’s payload in the system.

Also sets the matches.fuzzy property in extra_data to the list of the brief of first 5 record that matched.

Parameters:
  • obj – a workflow object.
  • eng – a workflow engine.
Returns:

True if the workflow object has a duplicate in the system False otherwise.

Return type:

bool

inspirehep.modules.workflows.tasks.matching.has_fully_harvested_category(record)[source]

Check if the record in obj.data has fully harvested categories.

Parameters:record (dict) – the ingested article.
Returns:True when the record belongs to an arXiv category that is fully harvested, otherwise False.
Return type:bool
inspirehep.modules.workflows.tasks.matching.has_more_than_one_exact_match(*args, **kwargs)[source]

Does the record have more than one exact match.

inspirehep.modules.workflows.tasks.matching.has_same_source(extra_data_key)[source]

Match a workflow in obj.extra_data[extra_data_key] by the source.

Takes a list of workflows from extra_data using as key extra_data_key and goes through them checking if at least one workflow has the same source of the current workflow object.

Parameters:
  • extra_data_key – the key to retrieve a workflow list from the current
  • object. (workflow) –
Returns:

True if a workflow, whose id is in obj.extra_data[ extra_data_key], matches the current workflow by the source.

Return type:

bool

inspirehep.modules.workflows.tasks.matching.is_fuzzy_match_approved(*args, **kwargs)[source]

Check if a fuzzy match has been approved by a human.

inspirehep.modules.workflows.tasks.matching.match_non_completed_wf_in_holdingpen(obj, eng)[source]

Return True if a matching wf is processing in the HoldingPen.

Uses a custom configuration of the inspire-matcher to find duplicates of the current workflow object in the Holding Pen not in the COMPLETED state.

Also sets holdingpen_matches in extra_data to the list of ids that matched.

Parameters:
  • obj – a workflow object.
  • eng – a workflow engine.
Returns:

True if the workflow object has a duplicate in the Holding Pen that is not COMPLETED, False otherwise.

Return type:

bool

inspirehep.modules.workflows.tasks.matching.match_previously_rejected_wf_in_holdingpen(obj, eng)[source]

Return True if matches a COMPLETED and rejected wf in the HoldingPen.

Uses a custom configuration of the inspire-matcher to find duplicates of the current workflow object in the Holding Pen in the COMPLETED state, marked as approved = False.

Also sets holdingpen_matches in extra_data to the list of ids that matched.

Parameters:
  • obj – a workflow object.
  • eng – a workflow engine.
Returns:

True if the workflow object has a duplicate in the Holding Pen that is not COMPLETED, False otherwise.

Return type:

bool

inspirehep.modules.workflows.tasks.matching.pending_in_holding_pen(*args, **kwargs)[source]

Return the list of matching workflows in the holdingpen.

Matches the holdingpen records by their arxiv_eprint, their doi, and by a custom validator function.

Parameters:
  • obj – a workflow object.
  • validation_func – a function used to filter the matched records.
Returns:

the ids matching the current obj that satisfy validation_func.

Return type:

(list)

inspirehep.modules.workflows.tasks.matching.physics_data_an_is_primary_category(record)[source]
inspirehep.modules.workflows.tasks.matching.raise_if_match_wf_in_error_or_initial(obj, eng)[source]

Raise if a matching wf is in ERROR or INITIAL state in the HoldingPen.

Uses a custom configuration of the inspire-matcher to find duplicates of the current workflow object in the Holding Pen not in the that are in ERROR or INITIAL state.

If any match is found, it sets error_workflows_matched in extra_data to the list of ids that matched and raise an error.

Parameters:
  • obj – a workflow object.
  • eng – a workflow engine.
Returns:

None

inspirehep.modules.workflows.tasks.matching.set_core_in_extra_data(*args, **kwargs)[source]

Set core=True in obj.extra_data if the record belongs to a core arXiv category

inspirehep.modules.workflows.tasks.matching.set_exact_match_as_approved_in_extradata(*args, **kwargs)[source]

Set the best match in matches.approved in extra_data.

inspirehep.modules.workflows.tasks.matching.set_fuzzy_match_approved_in_extradata(*args, **kwargs)[source]

Set the human approved match in matches.approved in extra_data.

inspirehep.modules.workflows.tasks.matching.stop_matched_holdingpen_wfs(obj, eng)[source]

Stop the matched workflow objects in the holdingpen.

Stops the matched workflows in the holdingpen by replacing their steps with a new one defined on the fly, containing a stop step, and executing it. For traceability reason, these workflows are also marked as 'stopped-by-wf', whose value is the current workflow’s id.

In the use case of harvesting twice an article, this function is involved to stop the first workflow and let the current one being processed, since it the latest metadata.

Parameters:
  • obj – a workflow object.
  • eng – a workflow engine.
Returns:

None

inspirehep.modules.workflows.tasks.matching.stop_processing(*args, **kwargs)[source]

Stop processing the given workflow.

Stops the given workflow engine. This causes the stop of all the workflows related to it.

Parameters:
  • obj – a workflow object.
  • eng – a workflow engine.
Returns:

None

inspirehep.modules.workflows.tasks.merging module

Tasks related to record merging.

inspirehep.modules.workflows.tasks.merging.has_conflicts(*args, **kwargs)[source]

Return if the workflow has any confilicts.

inspirehep.modules.workflows.tasks.merging.merge_articles(*args, **kwargs)[source]

Merge two articles.

The workflow payload is overwritten by the merged record, the conflicts are stored in extra_data.conflicts. Also, it adds a callback_url which contains the endpoint which resolves the merge conflicts.

Note

When the feature flag FEATURE_FLAG_ENABLE_MERGER is False it will skip the merge.

inspirehep.modules.workflows.tasks.refextract module

Workflow tasks using refextract API.

inspirehep.modules.workflows.tasks.refextract.extract_journal_info(*args, **kwargs)[source]

Extract the journal information from pubinfo_freetext.

Runs extract_journal_reference on the pubinfo_freetext key of each publication_info, if it exists, and uses the extracted information to populate the other keys.

Parameters:
  • obj – a workflow object.
  • eng – a workflow engine.
Returns:

None

inspirehep.modules.workflows.tasks.refextract.extract_references_from_pdf(*args, **kwargs)[source]

Extract references from PDF and return in INSPIRE format.

inspirehep.modules.workflows.tasks.refextract.extract_references_from_raw_ref(reference, custom_kbs_file=None)[source]

Extract references from raw references in reference element.

Parameters:
  • reference (dict) – a schema-compliant element of the references field. If it already contains a structured reference (that is, a reference key), no further processing is done. Otherwise, the contents of the raw_refs is extracted by refextract.
  • custom_kbs_file (dict) – configuration for refextract knowledge bases.
Returns:

a list of schema-compliant elements of the references field, with all previously unextracted references extracted.

Return type:

List[dict]

Note

This function returns a list of references because one raw reference might correspond to several references.

inspirehep.modules.workflows.tasks.refextract.extract_references_from_raw_refs(*args, **kwargs)[source]

Extract references from raw references in reference list.

Parameters:
  • references (List[dict]) – a schema-compliant references field. If an element already contains a structured reference (that is, a reference key), it is not modified. Otherwise, the contents of the raw_refs is extracted by refextract.
  • custom_kbs_file (dict) – configuration for refextract knowledge bases.
Returns:

a schema-compliant references field, with all previously unextracted references extracted.

Return type:

List[dict]

inspirehep.modules.workflows.tasks.refextract.extract_references_from_text(*args, **kwargs)[source]

Extract references from text and return in INSPIRE format.

inspirehep.modules.workflows.tasks.submission module

Contains INSPIRE specific submission tasks.

inspirehep.modules.workflows.tasks.submission.cleanup_pending_workflow(*args, **kwargs)[source]

Cleans up the pending workflow entry for this workflow if any.

inspirehep.modules.workflows.tasks.submission.close_ticket(ticket_id_key='ticket_id')[source]

Close the ticket associated with this record found in given key.

inspirehep.modules.workflows.tasks.submission.create_ticket(template, context_factory=None, queue='Test', ticket_id_key='ticket_id')[source]

Create a ticket for the submission.

Creates the ticket in the given queue and stores the ticket ID in the extra_data key specified in ticket_id_key.

inspirehep.modules.workflows.tasks.submission.filter_keywords(*args, **kwargs)[source]

Removes non-accepted keywords from the metadata

inspirehep.modules.workflows.tasks.submission.prepare_keywords(*args, **kwargs)[source]

Prepares the keywords in the correct format to be sent

inspirehep.modules.workflows.tasks.submission.reply_ticket(template=None, context_factory=None, keep_new=False)[source]

Reply to a ticket for the submission.

inspirehep.modules.workflows.tasks.submission.send_robotupload(url=None, callback_url='callback/workflows/robotupload', mode='insert', extra_data_key=None)[source]

Get the MARCXML from the model and ship it.

If callback_url is set the workflow will halt and the callback is responsible for resuming it.

inspirehep.modules.workflows.tasks.submission.send_to_legacy(obj, eng)[source]
inspirehep.modules.workflows.tasks.submission.submit_rt_ticket(*args, **kwargs)[source]

Submit ticket to RT with the given parameters.

inspirehep.modules.workflows.tasks.submission.wait_webcoll(*args, **kwargs)[source]

inspirehep.modules.workflows.tasks.upload module

Tasks related to record uploading.

inspirehep.modules.workflows.tasks.upload.set_schema(*args, **kwargs)[source]

Make sure schema is set properly and resolve it.

inspirehep.modules.workflows.tasks.upload.store_record(*args, **kwargs)[source]

Insert or replace a record.

inspirehep.modules.workflows.tasks.upload.store_root(*args, **kwargs)[source]

Insert or update the current record head’s root into the WorkflowsRecordSources table.

Module contents

Workflows tasks.