inspirehep.modules.disambiguation.core.ml package

Submodules

inspirehep.modules.disambiguation.core.ml.models module

Disambiguation core ML models.

class inspirehep.modules.disambiguation.core.ml.models.DistanceEstimator(ethnicity_estimator)[source]

Bases: object

fit()[source]
load_data(signatures_path, pairs_path, pairs_size, publications_path)[source]
load_model(input_filename)[source]
save_model(output_filename)[source]
class inspirehep.modules.disambiguation.core.ml.models.EthnicityEstimator(C=4.0)[source]

Bases: object

fit()[source]
load_data(input_filename)[source]
load_model(input_filename)[source]
predict(X)[source]
save_model(output_filename)[source]
inspirehep.modules.disambiguation.core.ml.models.get_abstract(signature)[source]
inspirehep.modules.disambiguation.core.ml.models.get_author_affiliation(signature)[source]
inspirehep.modules.disambiguation.core.ml.models.get_author_full_name(signature)[source]
inspirehep.modules.disambiguation.core.ml.models.get_author_other_names(signature)[source]
inspirehep.modules.disambiguation.core.ml.models.get_coauthors_neighborhood(signature, radius=10)[source]
inspirehep.modules.disambiguation.core.ml.models.get_collaborations(signature)[source]
inspirehep.modules.disambiguation.core.ml.models.get_first_given_name(signature)[source]
inspirehep.modules.disambiguation.core.ml.models.get_first_initial(signature)[source]
inspirehep.modules.disambiguation.core.ml.models.get_keywords(signature)[source]
inspirehep.modules.disambiguation.core.ml.models.get_second_given_name(signature)[source]
inspirehep.modules.disambiguation.core.ml.models.get_second_initial(signature)[source]
inspirehep.modules.disambiguation.core.ml.models.get_title(signature)[source]
inspirehep.modules.disambiguation.core.ml.models.get_topics(signature)[source]
inspirehep.modules.disambiguation.core.ml.models.group_by_signature(signatures)[source]

inspirehep.modules.disambiguation.core.ml.sampling module

Disambiguation core ML sampling.

inspirehep.modules.disambiguation.core.ml.sampling.sample_signature_pairs(signatures_path, clusters_path, pairs_size)[source]

Sample signature pairs to generate less training data.

Since INSPIRE contains ~3M curated signatures it would take too much time to train on all possible pairs, so we sample a subset in such a way that they are representative of the known cluster structure.

This is accomplished in three steps:

  1. First we read all the clusters and signatures and build in-memory data structures to perform fast lookups of the id of the cluster to which a signature belongs as well as lookups of the name of the author associated with the signature.

    At the same time we partition the signatures in blocks according to the phonetic encoding of the name. Note that two signatures pointing to two distinct authors might end up in the same block.

  2. Then we classify signature pairs that belong to the same block according to whether they belong to same cluster and whether they share the same author name.

    The former is because we want to have both examples of pairs of signatures in the same block pointing to the same author and different authors, while the latter is to avoid oversampling the typical case of signatures with exactly the same author name.

  3. Finally we sample from each of the non-empty resulting categories an equal portion of the desired number of pairs. Note that this requires that it must be divisible by 12, the LCM of the possible number of non-empty categories, to make sure that we will sample the same number of pairs from each category.

Yields:dict – a signature pair.

Module contents

Disambiguation core ML.