Ingestion of records (Workflows)

Inspire-next retrieves new records every day from several sources, such as:
  • External sites (arXiv, Proceedings of Science, ...).
  • Users, through submission forms.

The records harvested from external sites are all pulled in by hepcrawl, that is periodically executed by a celery beat task.

The Users also suggest new records, both literature records and author records by using the submission forms.

One of the main goals of Inspire is the high quality of the information it provides, so in order to achieve that, every record is carefully and rigorously revised by our team of curators befor finally getting accepted inside the Inspire database.

Below there’s a small diagram summarizing the process.

_images/workflows_overview.png

Handle workflows in error state

Via web interface

  1. Visit Holding Pen list and filter for records in error state.

  2. If any, you need to investigate why the record workflow failed, check the detailed page error report.

  3. Sometimes the fix is simply to restart the task again if it is due to some circumstantial reasons.

    You can do that from the interface by clicking the “current task” button and hit restart.

Via shell

  1. SSH into any worker machine (usually builder to avoid affecting the machines serving users)
  2. Enter the shell and retrieve all records in error state:
inspirehep shell
from invenio_workflows import workflow_object_class, ObjectStatus
errors = workflow_object_class.query(status=ObjectStatus.ERROR)
  1. Get a specific object:
from invenio_workflows import workflow_object_class
obj = workflow_object_class.get(1234)
obj.data  #  Check data
obj.extra_data   # Check extra data
obj.status  # Check status
obj.callback_pos  # Position in current workflow
  1. See associated workflow definition:
from invenio_workflows import workflows
workflows[obj.workflow.name].workflow   # Associated workflow list of tasks
  1. Manipulate position in the workflow
obj.callback_pos = [1, 2, 3]
obj.save()
# to persist the change in the db
from invenio_db import db
db.session.commit()
  1. Restart workflow in various positions:
obj.restart_current()  # Restart from current task and continue workflow
obj.restart_next()  # Skip current task and continue workflow
obj.restart_previous()  # Redo task before current one and continue workflow

# If the workflow is in inital state, you can start it from scratch
from invenio_workflows import start
start('article', object_id=obj.id)
# or for an author workflow
start('author', object_id=obj.id)