inspirehep.modules.refextract package

Submodules

inspirehep.modules.refextract.config module

Refextract config.

inspirehep.modules.refextract.config.REFERENCE_MATCHER_DATA_CONFIG = {'doc_type': 'data', 'source': ['control_number'], 'algorithm': [{'queries': [{'path': 'reference.dois', 'type': 'exact', 'search_path': 'dois.value.raw'}]}], 'index': 'records-data'}

Configuration for matching data records. Please note that the index and doc_type are different for data records.

inspirehep.modules.refextract.config.REFERENCE_MATCHER_DEFAULT_PUBLICATION_INFO_CONFIG = {'doc_type': 'hep', 'collections': ['Literature'], 'source': ['control_number'], 'algorithm': [{'queries': [{'paths': ['reference.publication_info.journal_issue', 'reference.publication_info.journal_title', 'reference.publication_info.journal_volume', 'reference.publication_info.artid'], 'type': 'nested', 'search_paths': ['publication_info.journal_issue', 'publication_info.journal_title.raw', 'publication_info.journal_volume', 'publication_info.page_artid']}, {'paths': ['reference.publication_info.journal_issue', 'reference.publication_info.journal_title', 'reference.publication_info.journal_volume', 'reference.publication_info.page_start'], 'type': 'nested', 'search_paths': ['publication_info.journal_issue', 'publication_info.journal_title.raw', 'publication_info.journal_volume', 'publication_info.page_artid']}, {'paths': ['reference.publication_info.journal_title', 'reference.publication_info.journal_volume', 'reference.publication_info.artid'], 'type': 'nested', 'search_paths': ['publication_info.journal_title.raw', 'publication_info.journal_volume', 'publication_info.page_artid']}, {'paths': ['reference.publication_info.journal_title', 'reference.publication_info.journal_volume', 'reference.publication_info.page_start'], 'type': 'nested', 'search_paths': ['publication_info.journal_title.raw', 'publication_info.journal_volume', 'publication_info.page_artid']}]}], 'index': 'records-hep'}

Configuration for matching all HEP records using publication_info. These are separate from the unique queries since these can result in multiple matches (particularly in the case of errata).

inspirehep.modules.refextract.config.REFERENCE_MATCHER_JHEP_AND_JCAP_PUBLICATION_INFO_CONFIG = {'doc_type': 'hep', 'collections': ['Literature'], 'source': ['control_number'], 'algorithm': [{'queries': [{'paths': ['reference.publication_info.journal_title', 'reference.publication_info.journal_volume', 'reference.publication_info.year', 'reference.publication_info.artid'], 'type': 'nested', 'search_paths': ['publication_info.journal_title.raw', 'publication_info.journal_volume', 'publication_info.year', 'publication_info.page_artid']}, {'paths': ['reference.publication_info.journal_title', 'reference.publication_info.journal_volume', 'reference.publication_info.year', 'reference.publication_info.page_start'], 'type': 'nested', 'search_paths': ['publication_info.journal_title.raw', 'publication_info.journal_volume', 'publication_info.year', 'publication_info.page_artid']}]}], 'index': 'records-hep'}

Configuration for matching records JCAP and JHEP records using the publication_info, since we have to look at the year as well for accurate matching. These are separate from the unique queries since these can result in multiple matches (particularly in the case of errata).

inspirehep.modules.refextract.config.REFERENCE_MATCHER_UNIQUE_IDENTIFIERS_CONFIG = {'doc_type': 'hep', 'collections': ['Literature'], 'source': ['control_number'], 'algorithm': [{'queries': [{'path': 'reference.arxiv_eprint', 'type': 'exact', 'search_path': 'arxiv_eprints.value.raw'}, {'path': 'reference.dois', 'type': 'exact', 'search_path': 'dois.value.raw'}, {'path': 'reference.isbn', 'type': 'exact', 'search_path': 'isbns.value.raw'}, {'path': 'reference.texkey', 'type': 'exact', 'search_path': 'texkeys.raw'}, {'path': 'reference.report_numbers', 'type': 'exact', 'search_path': 'report_numbers.value.fuzzy'}]}], 'index': 'records-hep'}

Configuration for matching all HEP records (including JHEP and JCAP records) using unique identifiers.

inspirehep.modules.refextract.matcher module

inspirehep.modules.refextract.matcher.match_reference(reference, previous_matched_recid=None)[source]

Match a reference using inspire-matcher.

Parameters:
  • reference (dict) – the metadata of a reference.
  • previous_matched_recid (int) – the record id of the last matched reference from the list of references.
Returns:

the matched reference.

Return type:

dict

inspirehep.modules.refextract.matcher.match_reference_with_config(reference, config, previous_matched_recid=None)[source]

Match a reference using inspire-matcher given the config.

Parameters:
  • reference (dict) – the metadata of the reference.
  • config (dict) – the list of inspire-matcher configurations for queries.
  • previous_matched_recid (int) – the record id of the last matched reference from the list of references.
Returns:

the matched reference.

Return type:

dict

inspirehep.modules.refextract.matcher.match_references(references)[source]

Match references to their respective records in INSPIRE.

Parameters:references (list) – the list of references.
Returns:the matched references.
Return type:list

inspirehep.modules.refextract.tasks module

Refextract tasks.

(task)inspirehep.modules.refextract.tasks.create_journal_kb_file[source]

Populate refextracts’s journal KB from the database.

Uses two raw DB queries that use syntax specific to PostgreSQL to generate a file in the format that refextract expects, that is a list of lines like:

SOURCE---DESTINATION

which represents that SOURCE is translated to DESTINATION when found.

Note that refextract expects SOURCE to be normalized, which means removing all non alphanumeric characters, collapsing all contiguous whitespace to one space and uppercasing the resulting string.

inspirehep.modules.refextract.utils module

Refextract utils.

class inspirehep.modules.refextract.utils.KbWriter(kb_path)[source]

Bases: object

add_entry(value, kb_key)[source]

Module contents

RefExtract integration.