ZincBase Documentation¶
CircleCI
DOI
Documentation Status

ZincBase is a state of the art knowledge base. It does the following:
- Extract facts (aka triples and rules) from unstructured data/text
- Store and retrieve those facts efficiently
- Build them into a graph
- Provide ways to query the graph, including via bleeding-edge graph neural networks.
Zincbase exists to answer questions like “what is the probability that Tom likes LARPing”, or “who likes LARPing”, or “classify people into LARPers vs normies”:

It combines the latest in neural networks with symbolic logic (think expert systems and prolog) and graph search.
View full documentation here.
Quickstart¶
from zincbase import KB
kb = KB()
kb.store('eats(tom, rice)')
for ans in kb.query('eats(tom, Food)'):
print(ans['Food']) # prints 'rice'
...
# The included assets/countries_s1_train.csv contains triples like:
# (namibia, locatedin, africa)
# (lithuania, neighbor, poland)
kb = KB()
kb.from_csv('./assets/countries.csv')
kb.build_kg_model(cuda=False, embedding_size=40)
kb.train_kg_model(steps=2000, batch_size=1, verbose=False)
kb.estimate_triple_prob('fiji', 'locatedin', 'melanesia')
0.8467
Requirements¶
- Python 3
- Libraries from requirements.txt
- GPU preferable for large graphs but not required
Installation¶
pip install -r requirements.txt
Note: Requirements might differ for PyTorch depending on your system.
Testing¶
python test/test_main.py
python test/test_graph.py
python test/test_lists.py
python test/test_nn_basic.py
python test/test_nn.py
python test/test_neg_examples.py
python test/test_truthiness.py
python -m doctest zincbase/zincbase.py
Validation¶
“Countries” and “FB15k” datasets are included in this repo.
There is a script to evaluate that ZincBase gets at least as good performance on the Countries dataset as the original (2019) RotatE paper. From the repo’s root directory:
python examples/eval_countries_s3.py
It tests the hardest Countries task and prints out the AUC ROC, which should be ~ 0.95 to match the paper. It takes about 30 minutes to run on a modern GPU.
There is also a script to evaluate performance on FB15k: python examples/fb15k_mrr.py
.
Building documentation¶
From docs/ dir: make html
. If something changed a lot: sphinx-apidoc -o . ..
TODO¶
- Refactor so node and edge are their own class
- Query all edges by attribute
- Rules (observables) to say ‘on change of attribute, run this small program and propagate changes’
- Will enable advanced simulation beginning with Abelian sandpile
- to_csv method
- To DOT, for visualization (integrate with github/anvaka/word2vec-graph)
- utilize postgres as backend triple store
- The to_csv/from_csv methods do not yet support node attributes.
- Reinforcement learning for graph traversal.
References & Acknowledgements¶
L334: Computational Syntax and Semantics – Introduction to Prolog, Steve Harlow
Open Book Project: Prolog in Python, Chris Meyers
Citing¶
If you use this software, please consider citing:
@software{zincbase,
author = {{Tom Grek}},
title = {ZincBase: A state of the art knowledge base},
url = {https://github.com/tomgrek/zincbase},
version = {0.1.1},
date = {2019-05-12}
}
Contributing¶
See CONTRIBUTING. And please do!
zincbase¶
logic package¶
Submodules¶
logic.Negative module¶
Internal ZincBase class for negative training examples
-
class
logic.Negative.
Negative
(expr)¶ Bases:
object
logic.Term module¶
A base unit for ZincBase’s Prolog-like implementation of ‘facts’
-
class
logic.Term.
Term
(expr, args=None, graph=None)¶ Bases:
object
logic.common module¶
Under-the-hood details of ZincBase’s Prolog-like implementation
-
logic.common.
process
(term, bindings, graph=None)¶
-
logic.common.
unify
(src, src_bindings, dest, dest_bindings)¶
Module contents¶
nn package¶
Submodules¶
nn.dataloader module¶
-
class
nn.dataloader.
BidirectionalOneShotIterator
(dataloader_head, dataloader_tail, dataloader_neg=None, neg_ratio=1)¶ Bases:
object
ZincBase uses this class automatically when you want to train a model from a KB.
-
next_no_neg
()¶
-
next_with_neg
()¶
-
static
one_shot_iterator
(dataloader)¶
-
-
class
nn.dataloader.
NegDataset
(neg_triples)¶ Bases:
sphinx.ext.autodoc.importer._MockObject
Zincbase sets this up automatically from the knowledge base. It’s a generator used for negative examples.
-
class
nn.dataloader.
TrainDataset
(triples, nrelation, negative_sample_size, mode)¶ Bases:
sphinx.ext.autodoc.importer._MockObject
Zincbase sets this up automatically from the knowledge base. It’s the generator for the RotatE algorithm.
-
static
collate_fn
(data)¶
-
static
count_frequency
(triples, start=4)¶
-
static
get_true_attr
(triples)¶
-
static
get_true_head_and_tail
(triples)¶
-
static
nn.rotate module¶
-
class
nn.rotate.
KGEModel
(model_name, nentity, nrelation, hidden_dim, gamma, double_entity_embedding=False, double_relation_embedding=False, node_attributes=[], pred_attributes=[], attr_loss_to_graph_loss=1.0, pred_loss_to_graph_loss=1.0, device='cuda')¶ Bases:
sphinx.ext.autodoc.importer._MockObject
-
ComplEx
(head, relation, tail, mode)¶
-
RotatE
(head, relation, tail, mode)¶
-
forward
(sample, mode='single', attributes=True, predict_pred_prop=False, predict_only=False)¶ A single forward pass
-
run_embedding
(embedding, attribute_name)¶
-
static
train_step
(model, optimizer, train_iterator, args)¶
-
Module contents¶
utils package¶
Submodules¶
utils.calc_auc_roc module¶
Calculate the Area-Under-the-Curve Receiver Operating Characteristic
A funny measure that combines precision and recall. Sklearn can’t agree how to implement it for multiclass; this version is from fbrundu on https://github.com/scikit-learn/scikit-learn/issues/3298
-
utils.calc_auc_roc.
calc_auc_roc
(truth, pred, average='macro')¶
utils.calc_mrr module¶
-
utils.calc_mrr.
calc_mrr
(kb, test_file, delimiter=', ', header=None, size=None)¶ Calculate the mean reciprocal rank using a test set.
utils.string_utils module¶
-
utils.string_utils.
cleanse
(line)¶
-
utils.string_utils.
split_on
(line, separator, all=True)¶
-
utils.string_utils.
split_to_parts
(line)¶
-
utils.string_utils.
strip_all_whitespace
(line)¶
Module contents¶
zincbase package¶
The main Zincbase package.
See README.md for some simple docs.
zincbase.zincbase module¶
-
class
zincbase.zincbase.
KB
¶ Bases:
object
Knowledge Base Class
>>> kb = KB() >>> kb.__class__ <class 'zincbase.KB'>
-
add_node_to_trained_kg
(sub, pred, ob)¶
-
attr
(node_name, attributes)¶ Set attributes on an existing graph node.
Parameters: - node_name (str) – Name of the node
- attributes (dict) – Dictionary of attributes to set
Example: >>> kb = KB() >>> kb.store('eats(tom, rice)') 0 >>> kb.attr('tom', {'is_person': True}) >>> kb.node('tom') {'is_person': True}
-
bfs
(start_node, target_node, max_depth=10, reverse=False)¶ Find a path from start_node to target_node
-
binary_classify
(subject, pred, ob)¶ Predict whether triple (sub, pred, ob) is true or not.
-
build_kg_model
(cuda=False, embedding_size=256, gamma=24, model_name='RotatE', node_attributes=[], attr_loss_to_graph_loss=1.0, pred_loss_to_graph_loss=1.0, pred_attributes=[])¶ Build the dictionaries and KGE model
Parameters: - node_attributes (list) – List of node attributes to include in the model. If node doesn’t possess the attribute, will be treated as zero. So far attributes must be floats.
- pred_attributes (list) – List of predicate attributes to include in the model.
- attr_loss_to_graph_loss (float) – % to scale attribute loss against graph loss. 0 would only take into account graph loss, math.inf would only take into account attr loss.
-
create_binary_classifier
(pred, ob)¶ Creates a binary classifier (SVM) for pred(?, ob) using embeddings from the trained model. Automatically compensates for class imbalance.
Follow it with binary_classify(sub, pred, ob) to predict whether the relation holds or not.
May be useful because although the model can estimate a probability for (sub, pred, ob), what threshold should you use to decide what constitutes True vs False?
Example: >>> kb = KB() >>> kb.seed(555) >>> kb.from_csv('./assets/countries_s1_train.csv', delimiter='\t') >>> kb.build_kg_model(cuda=False, embedding_size=100) >>> kb.train_kg_model(steps=2000, batch_size=1, verbose=False, neg_to_pos=4) >>> _ = kb.create_binary_classifier('locatedin', 'asia') >>> kb.binary_classify('india', 'locatedin', 'asia') True >>> kb.binary_classify('brazil', 'locatedin', 'asia') False
-
create_multi_classifier
(pred)¶ Build a classifier (SVM) for a predicate that can classify a subject, given a predicate, into one of the object entities from the KB that has that predicate relation. Automatically compensates for class imbalance.
Example: >>> kb = KB() >>> kb.from_csv('./assets/countries_s1_train.csv', delimiter='\t') >>> kb.seed(555) >>> kb.build_kg_model(cuda=False, embedding_size=40) >>> kb.train_kg_model(steps=1000, batch_size=1, verbose=False) >>> _ = kb.create_multi_classifier('locatedin') >>> kb.multi_classify('philippines', 'locatedin') 'south_eastern_asia'
-
delete_edge_attr
(sub, pred, ob, attributes)¶ Delete attributes previously set on a predicate between subject and object. To set the attribute in the first place, see also edge_attr.
Parameters: - sub (str) – Subject node/entity
- pred (str) – Predicate between subject and object
- ob (str) – Object node/entity
- attributes (list) – List of attributes to delete.
Returns: False if attribute was not present, else None.
-
delete_rule
(rule_idx)¶ Delete a rule from the KB.
Parameters: rule_idx – The index of the rule in the KB. Returned when the rule was added. May be int (if it was a real rule) or str (if it was a negative example - preceded by ~). Example: >>> kb = KB() >>> kb.store('a(a)') 0 >>> kb.delete_rule(0) True
-
edge
(sub, pred, ob)¶ Returns an edge and its attributes.
Parameters: - sub (str) – Subject node/entity
- pred (str) – Predicate between subject and object
- ob (str) – Object node/entity
Example: >>> kb = KB() >>> kb.store('eats(tom, rice)') 0 >>> kb.edge_attr('tom', 'eats', 'rice', {'used_to': 1.0}) >>> kb.edge('tom', 'eats', 'rice') {'used_to': 1.0}
-
edge_attr
(sub, pred, ob, attributes)¶ Set attributes on a predicate between subject and object. Useful for example to encode time, or truthiness.
Note that if any of the specified attributes have been previously set, this updates them with new values. To delete a set edge attribute, see also delete_edge_attr.
Parameters: - sub (str) – Subject node/entity
- pred (str) – Predicate between subject and object
- ob (str) – Object node/entity
- attributes (dict) – Attributes to set on the individual edge. Must be floats.
Example: >>> kb = KB() >>> kb.store('eats(tom, rice)') 0 >>> kb.edge_attr('tom', 'eats', 'rice', {'used_to': 1.0}) >>> kb.edge('tom', 'eats', 'rice') {'used_to': 1.0} >>> kb.edge_attr('tom', 'eats', 'rice', {'still_does': 1.0}) >>> kb.edge('tom', 'eats', 'rice') {'used_to': 1.0, 'still_does': 1.0}
-
entities
¶ All the entities in the KB.
Returns generator: Generator of all the entities
-
estimate_triple_prob
(sub, pred, ob)¶ Estimate the probability of the triple (sub, pred, ob) according to the trained model.
-
estimate_triple_prob_with_attrs
(sub, pred, ob, pred_prop)¶
-
filter
(filter_condition, candidate_nodes=None)¶ Filter (ie query) nodes by attributes.
Parameters: - filter_condition (function) – Test function
- candidate_nodes (List) – Nodes to test (optional; defaults to whole graph)
Example: >>> kb = KB() >>> kb.store('person(tom)') 0 >>> kb.attr('tom', {'cats': 0}) >>> list(kb.filter(lambda x: x['cats'] < 1)) ['tom']
-
fit_knn
(entities=None)¶ Fit an unsupervised sklearn kNN to the embeddings of entities.
Parameters: entities (list) – The entities that should be part of the kNN. Defaults to all if not specified
-
from_csv
(csvfile, header=None, start=0, size=None, delimiter=', ')¶
-
from_triples
(triples)¶ Stores facts from a list of tuples into the KB.
Parameters: triples (list) – List of tuples each of the form (subject, pred, object) Example: >>> kb = KB() >>> kb.from_triples([('b', 'a', 'c')]) >>> len(list(kb.query('a(b, c)'))) 1
-
get_embedding
(entity)¶
-
get_most_likely
(sub, pred, ob, candidates=None, k=1)¶ Return the k most likely triples to satisfy the input triple. One of sub, pred, or ob may be ‘?’.
Parameters: - candidates (list<str>) – Candidate entities/predicates. If None or not specified, this function will generate possible candidates from the rest of the triple.
- k (int) – The k in top k.
Example: >>> kb = KB() >>> kb.from_csv('./assets/countries_s1_train.csv', delimiter='\t') >>> kb.seed(555) >>> kb.build_kg_model(cuda=False, embedding_size=100) >>> kb.train_kg_model(steps=2000, batch_size=2, verbose=False, neg_to_pos=4) >>> kb.get_most_likely('austria', 'neighbor', '?', k=2) # doctest:+ELLIPSIS [{'prob': 0.9673, 'triple': ('austria', 'neighbor', 'germany')}, {'prob': 0.9656, 'triple': ('austria', 'neighbor', 'liechtenstein')}] >>> kb.get_most_likely('?', 'neighbor', 'austria', candidates=list(kb.entities), k=2) [{'prob': 0.9467, 'triple': ('slovenia', 'neighbor', 'austria')}, {'prob': 0.94, 'triple': ('liechtenstein', 'neighbor', 'austria')}] >>> kb.get_most_likely('austria', '?', 'germany', k=3) [{'prob': 0.9673, 'triple': ('austria', 'neighbor', 'germany')}, {'prob': 0.664, 'triple': ('austria', 'locatedin', 'germany')}]
-
get_nearest_neighbors
(entity, k=1)¶ Get the nearest neighbors to entity (embedding), according to the previously fit knn.
Parameters: - entity (str) – An entity
- k (int) – How many neighbors
-
load_all
(dirname='.', cuda=False)¶ Load KB (and model, if it exists) from the specified directory.
Parameters: - dirname (str) – Directory to load zb.pkl and (if present) pytorch_model.dict
- cuda (bool) – If the model exists, it will be loaded - specify if you want it to be on the GPU.
-
multi_classify
(subject, pred)¶ Predict object for subject according to the multi-classifer previously trained on pred.
-
neighbors
(node)¶ Return neighbors of node and predicates that connect them.
Parameters: node (str) – Name of the node Returns: List[(node_name, List[predicate])] Example: >>> kb = KB() >>> kb.store('knows(tom, shamala)') 0 >>> kb.neighbors('tom') [('shamala', [{'pred': 'knows'}])]
-
node
(node_name)¶ Get a node, and its attributes, from the graph.
Parameters: node_name (str) – Name of the node Returns: The node and its attributes. Example: >>> kb = KB() >>> kb.store('eats(tom, rice)') 0 >>> kb.node('tom') {} >>> kb.attr('tom', {'is_person': True}) >>> kb.node('tom') {'is_person': True}
-
plot
(density=1.0)¶ Plots a network diagram from (triple) nodes and edges in the KB.
Parameters: density (float) – Probability (0-1) that a given edge will be plotted, useful to thin out dense graphs for visualization.
-
predicates
¶ All the predicates (aka relations) in the KB.
Returns generator: Generator of all the predicates
-
query
(statement)¶ Query the KB.
Parameters: statement (str) – A rule to query on. Returns: Generator of alternative bindings to variables that match the query Example: >>> kb = KB() >>> kb.store('a(a)') 0 >>> kb.query('a(X)') #doctest: +ELLIPSIS <generator object KB._search at 0x...> >>> list(kb.query('a(X)')) [{'X': 'a'}]
-
save_all
(dirname='.')¶ Save current KB to the directory specified. Saves the (state dict of the) PyTorch model as well, if it has been built.
Parameters: dirname (str) – Directory in which to save the files. Creates the directory if it doesn’t already exist.
-
seed
(seed)¶ Seed the RNGs for PyTorch, NumPy, and Python itself.
Parameters: seed (int) – random seed Example: >>> KB().seed(555)
-
solidify
(predicate)¶ Query the KB (with Prolog) and ‘solidify’ facts in the KB, making them part of the graph, so that the NN can be trained.
Parameters: predicate (str) – A predicate (that’s a rule not a fact otherwise what’s the point) Example: >>> kb = KB() >>> kb.store('is(tom, human)') 0 >>> kb.store('has_part(shamala, head)') 1 >>> kb.store('is(X, human) :- has_part(X, head)') 2 >>> next(kb.query('is(tom, human)')) True >>> kb.to_triples() [('tom', 'is', 'human'), ('shamala', 'has_part', 'head')] >>> kb.solidify('is') 1 >>> kb.to_triples() [('tom', 'is', 'human'), ('shamala', 'has_part', 'head'), ('shamala', 'is', 'human')]
-
store
(statement, node_attributes=[], edge_attributes={})¶ Store a fact/rule in the KB
It is possible to store ‘false’ facts (negative examples) by preceding the predicate with a tilde (~). In this case, they do not come out in the graph and cannot be queried, but may assist when building the model.
Parameters: - statement (str) – Fact or rule to store in the KB.
- node_attributes (list<dict>) – List of length 2 with each element being a dict of items to set on the nodes (in order subject, object).
- edge_attributes (dict) – Dictionary of attributes to set on the edge. May include truthiness which, if < 0, automatically makes the rule a negative example.
Returns: the id of the fact/rule
Example: >>> KB().store('a(a)') 0
-
to_tensorboard_projector
(embeddings_filename, labels_filename, filter_fn=None)¶ Convert the KB’s trained embeddings to 2 files suitable for https://projector.tensorflow.org. This outputs only entity embeddings, not relation embeddings, a visualization of which may not be interpretable.
Parameters: - embeddings_filename (str) – Filename to output embeddings to, tsv format.
- labels_filename (str) – Filename to output labels to, one label per row.
- filter_fn (function) – Only include the embeddings/labels for which filter_fn(label) returns True
-
to_triples
(data=False)¶ Convert all facts in the KB to a list of triples, each of length 3 (or 4 if data=True). Any fact that is not arity 2 will be ignored.
Note: While the Prolog style representation uses pred(subject, object), the triple representation is (subject, pred, object). Parameters: data (bool) – Whether to return subject, predicate and object attributes as elements 4, 5, and 6 of the triple. The 7th element of the triple is usually False, but is True when the fact/triple is a negative example. Returns: list of triples (tuples of length 3 or 7 if data=True) Example: >>> kb = KB() >>> kb.store('a(b, c)') 0 >>> kb.to_triples() [('b', 'a', 'c')] >>> kb.store('a(a)') 1 >>> kb.to_triples() [('b', 'a', 'c')] >>> kb.attr('b', {'an_attribute': 'xyz'}) >>> kb.to_triples() [('b', 'a', 'c')] >>> kb.to_triples(data=True) [('b', 'a', 'c', {'an_attribute': 'xyz'}, {}, {}, False)]
-
train_kg_model
(steps=1000, batch_size=512, lr=0.001, reencode_triples=False, neg_to_pos=128, neg_ratio=1.0, verbose=True)¶ Train a KG model on the KB.
Parameters: - steps (int) – Number of training steps
- batch_size (int) – Batch size for training
- lr (float) – Initial learning rate for Adam optimizer
- reencode_triples (bool) – If a node has been added since last training, set this to True
- neg_to_pos (int) – Ratio of generated negative samples to real positive samples
- neg_ratio (float) – How often real/inputted negative examples should appear, vs real pos + generated neg. Smaller (>0) means more often.
-
Negative Examples¶
Negative examples can be added to a Zincbase in two ways. Either:
- Prefix a rule with ~, such as ~likes(tom, sprouts)
- Give it a truthiness attribute that’s less than zero.
Concretely, this looks like:
kb.store('~likes(tom, sprouts)')
kb.store('likes(tom, sprouts)', edge_attributes={'truthiness': -1})
Negative examples are fed in to the KG model as part of the usual training regime; you may control the frequency that this happens with the neg_ratio kwarg of KB.train_kg_model.
Note that you can specify truthiness as something you want the model to learn to predict (i.e. specify pred_attributes=[‘truthiness’] when you call build_kg_model). But, negative truthiness takes the example out of the normal flow of this: only examples with 0 <= truthiness <= 1 are part of ‘proper’ training where the predicate prediction is taken into account.
Anecdotally, negative examples do not help much, or only help with small datasets.