Commit 535d446c authored by markus's avatar markus
Browse files

Merge branch 'main' of https://git.cs.uni-paderborn.de/lgehring/lsm into main

parents 1f7665e5 c15252b4
<!-- PROJECT LOGO -->
<br />
<p align="center">
......@@ -9,8 +8,7 @@
<h3 align="center">LSM Group: Knowledge Graphs - Mini Project - Summer Term 2021</h3>
<p align="left">
This repository is the presentation of our mini-project for the Foundations of Knowledge Graphs lecture at Paderborn University in Germany.
We were provided with 25 learning problems from the Carcinogenesis dataset, each having included and excluded components.The task has been to develop a classifier that can determine the carcinogenicity of new components based on the learning problems from the Carcinogenesis dataset.
This repository represents our work regarding the mini-project for the Foundations of Knowledge Graphs lecture at Paderborn University in Germany. We were provided with 25 learning problems from the Carcinogenesis dataset, each having included and excluded components. The task has been to develop a classifier that can determine the carcinogenicity of new components based on the learning problems from the Carcinogenesis dataset.
</p>
</p>
......@@ -45,28 +43,38 @@
<!-- APPROACH -->
## Approach
We decided on using embeddings to represent the carciogenesis dataset in an efficient form.
We decided on using embeddings to represent the carcinogenesis dataset in an efficient form.
This was done using the PyKeen library, which offers a myriad of different embedding models.
Further it can be configured with different parameters like the number of epochs, or the dimension
of the generated embedding.
Further, it can be configured with different parameters like the number of epochs, or the dimension
of the generated embedding. In our test, the embedding model "TransR" worked best with our approach.
TransR is a translation based approach similar to TransE with the addition, that it represents relations
and entities in different vector spaces, thereby increasing the spatial distance between instances.
To make predictions using these embeddings, we first used typical machine learning algorithms such as
random forests, logistic regression, or clustering algorithms such as kNN. In doing so, we encountered
the problem that very many of the learning problems have a very unbalanced ratio of positive and negative
(included and excluded) instances.
For learning problems that had an extremely high proportion of negative (excluded) instances,
the classification algorithms classified all instances as negative, since these mostly optimize the accuracy
the classification algorithms classified all instances as negative since these mostly optimize the accuracy
instead of the F1 score.
To overcome this problem, we tried to balance the training data before the training. Since undersampling,
with a very small amount of positive instances leads to a very small training data set,
we therefore decided to oversample. The oversampling algorithm we used is the SMOTE implementation of the sklearn extension
imbalenced-learn (https://github.com/scikit-learn-contrib/imbalanced-learn). In simple terms, SMOTE calculates
with a very small amount of positive instances, leads to a very small training data set,
we, therefore, decided to oversample. The oversampling algorithm we used is the SMOTE implementation of the sklearn extension
imbalanced-learn (https://github.com/scikit-learn-contrib/imbalanced-learn). In simple terms, SMOTE calculates
new synthetic data points for the smaller class, each of which lies on the line between two data points of this class.
Using this technique and a Linear SVM, we were able to at least slightly improve the problem of overweighting the negative class.
Using this we could achieve F1-Scores ranging from \<lower_bound> up to \<higher_bound> for the given test lps.
We split the data in to learning and test in a ratio of \<ratio>.
We split the data into learning and test in a ratio of \<ratio>.
## Other approaches
We tried out several different approaches to tackle the given task of classifying entities. These approaches can be found
in the folder "other_approaches" as Jupyter notebooks.
### SKLEARN Clustering
In the notebook "dbscan_clustering.ipynb" we explored the possibility to use clustering algorithms defined in sklearn to classify the given entities. Here we choose DBSCAN, as SKLEARN states it working well with imbalanced datasets. Unfortunately, the approach did not yield good results and was therefore no longer pursued.
### PyTorch Geometric Graph Neural Network
A second approach was the implementation of a graph neural network from the library pytorch_geometric, i.e. a deep learning approach. The idea was to use a graph neural network for classification based on the labels of the learning problems and the edges of the knowledge graph. The first step was to fit the network using the train data and the CrossEntropyLoss as metric and, after that, classify all individuals (even the ones used for training). The network computes a probability distribution over the labels for each individual and the individuals are assigned to the class with the highest probability. However, since the data are very imbalanced, all individuals are assigned to the negative (excluded) class and the F1-score was not very meaningful. Unfortunately, it was not possible to find a solution for this problem, hence this approach was no longer in our interest.
<!-- PREREQISITES -->
### Prerequisites
......@@ -97,44 +105,17 @@ We split the data in to learning and test in a ratio of \<ratio>.
```sh
sudo apt install python3.8.10
```
3. Clone the repo
2. Clone the repo
```sh
git clone https://git.cs.uni-paderborn.de/lgehring/lsm.git
```
4. Install PyKEEN
```sh
pip3 install pykeen
```
5. Install Seaborn
```sh
pip3 install seaborn
```
6. Install Pandas
```sh
pip3 install pandas
```
5. Install Matplotlib
```sh
pip3 install matplotlib
```
7. Install Numpy
```sh
pip3 install numpy
```
8. Install Rdflib
```sh
pip3 install rdflib
```
9. Install Rdflib
3. Install required Libraries
```sh
pip3 install scikit_learn
```
10. Install [TODO]
```sh
pip3 install [TODO]
pip3 install -r requirements.txt
```
<!-- USAGE -->
## Usage
<!-- BUILD-EMBEDDINGS -->
......@@ -159,38 +140,24 @@ http://dl-learner.org/carcinogenesis#Carbon-232 [-0.31581488, 0.8278251, 1.87275
http://dl-learner.org/carcinogenesis#Carbon-26 [-0.103435785, -1.2094345, -0.18882215, 2.0368776, 0.3304364, -1.7264694, -0.38451058, 0.06835548, -1.3024201, 0.16077128, -0.6984507, -0.29645622, 0.021067962, 1.4021096, 1.9172877, -2.2997203, 1.0408328, 0.24595535, -0.0757225, 0.41191146, -0.24012361, -1.6148175, -0.9519527, -0.0012898605, -0.24245678, 0.5220458, 0.28011653, 0.27396503, -0.09945937, 1.8605173, -1.373711, -1.4735564]\
http://dl-learner.org/carcinogenesis#Carbon-27 [-1.8802489, 2.185924, -0.7223453, -1.0277753, 1.2828372, -1.8145577, 0.041590724, -0.24165802, -0.5704698, 0.93525743, -0.9134435, 0.8481486, 0.46955204, -0.47266957, -2.4214704, -0.6310501, -1.1237596, -2.3589735, 0.37650838, 1.8736081, -0.9354778, -0.65831023, -1.2054998, 1.0181395, 0.5560374, -0.12456948, 0.40127212, -0.046274118, -1.456181, 1.7935433, -0.41356027, 0.081598125]\
[...]
### PCA Plot Embeddings (Optional)
#### Command
```sh
pip3 install [TODO]
```
#### Output
* [TODO] List output
<!-- RUN-CLASSIFIER -->
### Run classifier
#### Command
```sh
pip3 install [TODO]
```
#### Output
* [TODO] List output
#### [TODO]
#### Command
```sh
pip3 install [TODO]
python approach.py
```
#### Output
* [TODO] List output
Output predictions for all learning problems in turtle syntax into predictions.ttl
<!-- CONTACT -->
## Contact
* Lukas Gehring - lgehring - lgehring@mail.uni-paderborn.de
* Sven Meyer - svemey98 - svemey98@mail.uni-paderborn.de
* Markus Röse - mroese - mroese@mail.uni-paderborn.de
* Mohness Waizy - waizy - waizy@mail.upb.de
* Lukas Gehring - lgehring - 7082490 - lgehring@mail.uni-paderborn.de
* Sven Meyer - svemey98 - 7133064 - svemey98@mail.uni-paderborn.de
* Markus Röse - mroese - 7087673 - mroese@mail.uni-paderborn.de
* Mohness Waizy - waizy - 7120556 - waizy@mail.uni-paderborn.de
Project Link: [https://git.cs.uni-paderborn.de/lgehring/lsm.git](https://git.cs.uni-paderborn.de/lgehring/lsm.git)
from rdflib import Graph, URIRef
import sys
from sklearn.svm import LinearSVC
from pykeen_embeddings import load_embeddings_from_file
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from imblearn.over_sampling import SMOTE
from rdflib import Graph, Literal, Namespace, XSD, URIRef
def read(path, format):
......@@ -42,10 +42,8 @@ def extract_resources(g: Graph, lp: int):
return included_res, excluded_res
def generate_features(path="data/embeddings/embeddings_carcinogenesis_transe_16dim.tsv", lp=1):
g1 = read(path='data/kg-mini-project-train_v2.ttl', format='turtle')
included_res, excluded_res = extract_resources(g1, lp=lp)
def generate_features(included_res: list, excluded_res: list,
path="data/embeddings/embeddings_carcinogenesis_transr_64dim.tsv"):
embeddings = load_embeddings_from_file(embeddings_file=path)
df = pd.DataFrame.from_dict(data=embeddings, orient='index')
......@@ -58,46 +56,82 @@ def generate_features(path="data/embeddings/embeddings_carcinogenesis_transe_16d
return df.iloc[:, :-1], df['y']
def get_to_classify(path_to_embeddings : str, pos : list, neg : list):
embeddings = load_embeddings_from_file(embeddings_file=path_to_embeddings)
def get_to_classify(included_res: list, excluded_res: list,
path="data/embeddings/embeddings_carcinogenesis_transr_64dim.tsv"):
embeddings = load_embeddings_from_file(embeddings_file=path)
entity_names = open("data/all_entities.txt", "r").read().split("\n")
fin = {}
all_given = pos + neg
all_given = included_res + excluded_res
for k in embeddings:
if k not in all_given:
fin[k] = embeddings[k]
fin[k.split('#')[-1]] = embeddings[k]
return fin
scores = []
for lp in range(1, 25):
X, y = generate_features(path="data/embeddings/embeddings_carcinogenesis_transr_64dim.tsv", lp=lp)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
ros = SMOTE(random_state=0)
X_train, y_train = ros.fit_resample(X_train, y_train)
def __main__(path='data/kg-mini-project-grading.ttl'):
g = read(path=path, format='turtle')
data = {}
for lp in range(33, 50):
print(f"start lp {lp}")
included_res, excluded_res = extract_resources(g, lp=lp)
X_train, y_train = generate_features(included_res, excluded_res)
ros = SMOTE(random_state=0)
X_train, y_train = ros.fit_resample(X_train, y_train)
clf = LinearSVC(random_state=1).fit(X_train, y_train)
zeros = y_train.value_counts()[0]
ones = y_train.value_counts()[1]
total = zeros + ones
embeddings = get_to_classify(included_res, excluded_res)
weight_zero = 1 - zeros / total
weight_one = 1 - ones / total
X_pred = list(embeddings.values())
weights = [weight_zero if x == 0 else weight_one for x in y_train]
predictions = clf.predict(X_pred)
pred_dict = {}
for x, y in zip(embeddings.keys(), predictions):
pred_dict[x] = y
pred_included = [k for k, v in pred_dict.items() if v == 1]
pred_excluded = [k for k, v in pred_dict.items() if v == 0]
data[lp] = (pred_included, pred_excluded)
print(f"finished lp {lp}")
create_rdf(data)
def create_rdf(data):
# create a Graph
g = Graph()
clf = LinearSVC(random_state=1).fit(X_train, y_train)
predictions = clf.predict(X_test)
# print(predictions)
# print(y_test.to_numpy())
# prefix
CARCINOGENESIS = Namespace("http://dl-learner.org/carcinogenesis#")
LPRES = Namespace("https://lpbenchgen.org/resource/")
LPPROP = Namespace("https://lpbenchgen.org/property/")
g.bind("carcinogenesis", CARCINOGENESIS)
g.bind("lpres", LPRES)
g.bind("lpprop", LPPROP)
f1 = f1_score(y_test, predictions)
scores.append(f1)
print(f'LP{lp}, f1:{f1}')
for lp, (included, excluded) in data.items():
# included res
g.add((LPRES[f"result_{lp}pos"], LPPROP.belongsToLP, Literal("true", datatype=XSD.boolean)))
g.add((LPRES[f"result_{lp}pos"], LPPROP.pertainsTo, LPRES[f"lp_{lp}"]))
for res in included:
g.add((LPRES[f"result_{lp}pos"], LPPROP.resource, CARCINOGENESIS[res]))
def average(lst):
return sum(lst) / len(lst)
# excluded res
g.add((LPRES[f"result_{lp}neg"], LPPROP.belongsToLP, Literal("false", datatype=XSD.boolean)))
g.add((LPRES[f"result_{lp}neg"], LPPROP.pertainsTo, LPRES[f"lp_{lp}"]))
for res in excluded:
g.add((LPRES[f"result_{lp}neg"], LPPROP.resource, CARCINOGENESIS[res]))
g.serialize(destination='predictions.ttl', format='turtle')
average = average(scores)
print(average)
__main__()
This diff is collapsed.
This diff is collapsed.
......@@ -26,7 +26,7 @@
"cells": [
{
"cell_type": "code",
"execution_count": 20,
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
......@@ -129,50 +129,402 @@
},
{
"cell_type": "code",
"execution_count": 21,
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"learning_problems = read(path='../data/kg-mini-project-train_v2.ttl', format='turtle')\n",
"embeddings = load_embeddings_from_file(\"../data/embeddings/embeddings_carcinogenesis_transr_16dim.tsv\")\n",
"\n",
"# Combine positive and negative results\n",
"pos, neg = extract_resources(learning_problems, 2)\n",
"all_res = pos + neg\n",
"all_res_embeddings = np.array([embeddings[x] for x in all_res])\n",
"def execute_on_lp(lps : Graph, embeddings : list, lp : int):\n",
" # Combine positive and negative results\n",
" pos, neg = extract_resources(lps, lp)\n",
" all_res = pos + neg\n",
" all_res_embeddings = np.array([embeddings[x] for x in all_res])\n",
"\n",
"clustering = KMeans(n_clusters=2, random_state=0).fit(all_res_embeddings)"
" clustering = KMeans(n_clusters=2, random_state=0).fit(all_res_embeddings)\n",
" # Identify cluster with majority positive examples\n",
" positives_label = 1 if sum(clustering.labels_[:len(pos)]) > len(pos)/2 else 0\n",
" negatives_label = 1 - 1 * positives_label\n",
" print(\"Positive cluster label: \" + str(positives_label))\n",
" print(\"Negative cluster label: \" + str(negatives_label))\n",
" TP = list(clustering.labels_[:len(pos)]).count(positives_label)\n",
" print(\"True positives: \" + str(TP))\n",
" TN = list(clustering.labels_[len(pos):]).count(negatives_label)\n",
" print(\"True negatives: \" + str(TN))\n",
" FP = list(clustering.labels_[len(pos):]).count(positives_label)\n",
" print(\"False positives: \" + str(FP))\n",
" FN = list(clustering.labels_[:len(pos)]).count(negatives_label)\n",
" print(\"False negatives: \" + str(FN))\n",
" print_metrics(TP, TN, FP, FN)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"execution_count": 8,
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Positive cluster label: 0\nNegative cluster label: 1\nTrue positives: 78\nTrue negatives: 10979\nFalse positives: 11246\nFalse negatives: 69\nAccuracy: 0.4942338637582693\nPrecision: 0.00688802543270929\nRecall: 0.5306122448979592\nF1-Score: 0.013599511812396478\n"
"Execution for LP 1: \n",
"Positive cluster label: 0\n",
"Negative cluster label: 1\n",
"True positives: 107\n",
"True negatives: 10714\n",
"False positives: 11551\n",
"False negatives: 0\n",
"Accuracy: 0.4836849633470409\n",
"Precision: 0.009178246697546749\n",
"Recall: 1.0\n",
"F1-Score: 0.018189545261368466\n",
"\n",
"\n",
"\n",
"Execution for LP 2: \n",
"Positive cluster label: 1\n",
"Negative cluster label: 0\n",
"True positives: 76\n",
"True negatives: 10646\n",
"False positives: 11579\n",
"False negatives: 71\n",
"Accuracy: 0.4792597890219918\n",
"Precision: 0.006520806520806521\n",
"Recall: 0.5170068027210885\n",
"F1-Score: 0.012879173021521775\n",
"\n",
"\n",
"\n",
"Execution for LP 3: \n",
"Positive cluster label: 0\n",
"Negative cluster label: 1\n",
"True positives: 504\n",
"True negatives: 10708\n",
"False positives: 11154\n",
"False negatives: 6\n",
"Accuracy: 0.5011621669944574\n",
"Precision: 0.04323211528564076\n",
"Recall: 0.9882352941176471\n",
"F1-Score: 0.08284023668639054\n",
"\n",
"\n",
"\n",
"Execution for LP 4: \n",
"Positive cluster label: 0\n",
"Negative cluster label: 1\n",
"True positives: 25\n",
"True negatives: 10712\n",
"False positives: 11635\n",
"False negatives: 0\n",
"Accuracy: 0.47993026998033256\n",
"Precision: 0.002144082332761578\n",
"Recall: 1.0\n",
"F1-Score: 0.004278990158322636\n",
"\n",
"\n",
"\n",
"Execution for LP 5: \n",
"Positive cluster label: 0\n",
"Negative cluster label: 1\n",
"True positives: 767\n",
"True negatives: 10703\n",
"False positives: 10889\n",
"False negatives: 13\n",
"Accuracy: 0.5126944394779188\n",
"Precision: 0.06580301990391214\n",
"Recall: 0.9833333333333333\n",
"F1-Score: 0.1233515599871341\n",
"\n",
"\n",
"\n",
"Execution for LP 6: \n",
"Positive cluster label: 1\n",
"Negative cluster label: 0\n",
"True positives: 101\n",
"True negatives: 10639\n",
"False positives: 11560\n",
"False negatives: 72\n",
"Accuracy: 0.4800643661720007\n",
"Precision: 0.008661349798473545\n",
"Recall: 0.5838150289017341\n",
"F1-Score: 0.01706946087544364\n",
"\n",
"\n",
"\n",
"Execution for LP 7: \n",
"Positive cluster label: 0\n",
"Negative cluster label: 1\n",
"True positives: 214\n",
"True negatives: 10715\n",
"False positives: 11443\n",
"False negatives: 0\n",
"Accuracy: 0.48851242624709457\n",
"Precision: 0.018358068113579824\n",
"Recall: 1.0\n",
"F1-Score: 0.036054249852581925\n",
"\n",
"\n",
"\n",
"Execution for LP 8: \n",
"Positive cluster label: 1\n",
"Negative cluster label: 0\n",
"True positives: 96\n",
"True negatives: 10718\n",
"False positives: 11558\n",
"False negatives: 0\n",
"Accuracy: 0.4833720722331486\n",
"Precision: 0.008237515016303414\n",
"Recall: 1.0\n",
"F1-Score: 0.016340425531914893\n",
"\n",
"\n",
"\n",
"Execution for LP 9: \n",
"Positive cluster label: 0\n",
"Negative cluster label: 1\n",
"True positives: 97\n",
"True negatives: 10711\n",
"False positives: 11564\n",
"False negatives: 0\n",
"Accuracy: 0.4831038798498123\n",
"Precision: 0.008318326044078552\n",
"Recall: 1.0\n",
"F1-Score: 0.016499404660656577\n",
"\n",
"\n",
"\n",
"Execution for LP 10: \n",
"Positive cluster label: 1\n",
"Negative cluster label: 0\n",
"True positives: 312\n",
"True negatives: 10703\n",
"False positives: 11350\n",
"False negatives: 7\n",
"Accuracy: 0.49235651707491507\n",
"Precision: 0.026753558566283656\n",
"Recall: 0.9780564263322884\n",
"F1-Score: 0.052082463901176865\n",
"\n",
"\n",
"\n",
"Execution for LP 11: \n",
"Positive cluster label: 1\n",
"Negative cluster label: 0\n",
"True positives: 44\n",
"True negatives: 11616\n",
"False positives: 10669\n",
"False negatives: 43\n",
"Accuracy: 0.5211871982835687\n",
"Precision: 0.004107159525809764\n",
"Recall: 0.5057471264367817\n",
"F1-Score: 0.008148148148148147\n",
"\n",
"\n",
"\n",
"Execution for LP 12: \n",
"Positive cluster label: 1\n",
"Negative cluster label: 0\n",
"True positives: 126\n",
"True negatives: 10710\n",
"False positives: 11535\n",
"False negatives: 1\n",
"Accuracy: 0.48435544430538174\n",
"Precision: 0.010805248263442244\n",
"Recall: 0.9921259842519685\n",
"F1-Score: 0.02137767220902613\n",
"\n",
"\n",
"\n",
"Execution for LP 13: \n",
"Positive cluster label: 0\n",
"Negative cluster label: 1\n",
"True positives: 18\n",
"True negatives: 10716\n",
"False positives: 11638\n",
"False negatives: 0\n",
"Accuracy: 0.4797961737886644\n",
"Precision: 0.0015442690459849004\n",
"Recall: 1.0\n",
"F1-Score: 0.0030837759122837073\n",
"\n",
"\n",
"\n",
"Execution for LP 14: \n",
"Positive cluster label: 0\n",
"Negative cluster label: 1\n",
"True positives: 44\n",
"True negatives: 10716\n",
"False positives: 11610\n",
"False negatives: 2\n",
"Accuracy: 0.4809583407831218\n",
"Precision: 0.003775527715805732\n",
"Recall: 0.9565217391304348\n",
"F1-Score: 0.007521367521367521\n",
"\n",
"\n",
"\n",
"Execution for LP 15: \n",
"Positive cluster label: 0\n",
"Negative cluster label: 1\n",
"True positives: 338\n",
"True negatives: 11634\n",
"False positives: 10383\n",
"False negatives: 17\n",
"Accuracy: 0.5351332022170571\n",
"Precision: 0.03152690980319\n",
"Recall: 0.952112676056338\n",
"F1-Score: 0.06103286384976526\n",
"\n",
"\n",
"\n",
"Execution for LP 16: \n",
"Positive cluster label: 0\n",
"Negative cluster label: 1\n",
"True positives: 15\n",
"True negatives: 10713\n",
"False positives: 11644\n",
"False negatives: 0\n",
"Accuracy: 0.4795279814053281\n",
"Precision: 0.0012865597392572262\n",
"Recall: 1.0\n",
"F1-Score: 0.002569813260236423\n",
"\n",
"\n",
"\n",
"Execution for LP 17: \n",
"Positive cluster label: 0\n",
"Negative cluster label: 1\n",
"True positives: 1936\n",
"True negatives: 9131\n",
"False positives: 9715\n",
"False negatives: 1590\n",
"Accuracy: 0.4946808510638298\n",
"Precision: 0.1661659943352502\n",
"Recall: 0.5490640952921158\n",
"F1-Score: 0.2551228833102721\n",
"\n",
"\n",
"\n",
"Execution for LP 18: \n",
"Positive cluster label: 1\n",
"Negative cluster label: 0\n",
"True positives: 321\n",
"True negatives: 10716\n",
"False positives: 11335\n",
"False negatives: 0\n",
"Accuracy: 0.4933398891471482\n",
"Precision: 0.02753946465339739\n",
"Recall: 1.0\n",
"F1-Score: 0.05360273858228271\n",
"\n",
"\n",
"\n",
"Execution for LP 19: \n",
"Positive cluster label: 1\n",
"Negative cluster label: 0\n",
"True positives: 42\n",
"True negatives: 10710\n",
"False positives: 11620\n",
"False negatives: 0\n",
"Accuracy: 0.4806007509386733\n",
"Precision: 0.003601440576230492\n",
"Recall: 1.0\n",
"F1-Score: 0.007177033492822966\n",
"\n",
"\n",
"\n",
"Execution for LP 20: \n",
"Positive cluster label: 0\n",
"Negative cluster label: 1\n",
"True positives: 632\n",
"True negatives: 10708\n",
"False positives: 11023\n",
"False negatives: 9\n",
"Accuracy: 0.5068836045056321\n",
"Precision: 0.05422565422565422\n",
"Recall: 0.9859594383775351\n",
"F1-Score: 0.10279765777488614\n",
"\n",
"\n",
"\n",
"Execution for LP 21: \n",
"Positive cluster label: 0\n",
"Negative cluster label: 1\n",
"True positives: 26\n",