Commit 5c2c1c2a authored by markus's avatar markus
Browse files

Clean up and updated README

parent 535d446c
......@@ -2,7 +2,7 @@
<br />
<p align="center">
<a align="center" href="https://git.cs.uni-paderborn.de/lgehring/lsm.git">
<img src="pexels-pixabay-373543.jpg" alt="Logo" width="1000" height="200">
<img src="img/pexels-pixabay-373543.jpg" alt="Logo" width="1000" height="200">
</a>
<h3 align="center">LSM Group: Knowledge Graphs - Mini Project - Summer Term 2021</h3>
......@@ -35,6 +35,9 @@ This repository represents our work regarding the mini-project for the Foundatio
<li>
<a href="#usage">Usage</a>
</li>
<li>
<a href="#other-approaches">Other Approaches</a>
</li>
<li><a href="#contact">Contact</a></li>
</ol>
</details>
......@@ -64,17 +67,8 @@ imbalanced-learn (https://github.com/scikit-learn-contrib/imbalanced-learn). In
new synthetic data points for the smaller class, each of which lies on the line between two data points of this class.
Using this technique and a Linear SVM, we were able to at least slightly improve the problem of overweighting the negative class.
Using this we could achieve F1-Scores ranging from \<lower_bound> up to \<higher_bound> for the given test lps.
We split the data into learning and test in a ratio of \<ratio>.
## Other approaches
We tried out several different approaches to tackle the given task of classifying entities. These approaches can be found
in the folder "other_approaches" as Jupyter notebooks.
### SKLEARN Clustering
In the notebook "dbscan_clustering.ipynb" we explored the possibility to use clustering algorithms defined in sklearn to classify the given entities. Here we choose DBSCAN, as SKLEARN states it working well with imbalanced datasets. Unfortunately, the approach did not yield good results and was therefore no longer pursued.
### PyTorch Geometric Graph Neural Network
A second approach was the implementation of a graph neural network from the library pytorch_geometric, i.e. a deep learning approach. The idea was to use a graph neural network for classification based on the labels of the learning problems and the edges of the knowledge graph. The first step was to fit the network using the train data and the CrossEntropyLoss as metric and, after that, classify all individuals (even the ones used for training). The network computes a probability distribution over the labels for each individual and the individuals are assigned to the class with the highest probability. However, since the data are very imbalanced, all individuals are assigned to the negative (excluded) class and the F1-score was not very meaningful. Unfortunately, it was not possible to find a solution for this problem, hence this approach was no longer in our interest.
<!--Using this we could achieve F1-Scores ranging from \<lower_bound> up to \<higher_bound> for the given test lps.
We split the data into learning and test in a ratio of \<ratio>.-->
<!-- PREREQISITES -->
### Prerequisites
......@@ -127,12 +121,18 @@ the file can be on its own to generate embeddings:
```sh
python pykeen_embeddings.py [-h] OWL_File Triple_File Model_Name Dimension Target_File
```
positional arguments:
* OWL_File: File containing graph to embed in owl format.
* Triple_File: Path to .tsv style to store graph tripels
* Model_Name: Model to use for embeddings, e.g. TransR or TransE. Model must be supported by PyKeen.
* Dimension: Target dimension of embeddings.
* Target_File: File to save generated embeddings in. Should be of type .tsv.
<b>Example call</b>
```sh
python pykeen_embeddings.py data/carcinogenesis.owl t.tsv TransE 2 out.tsv
```
<b>Positional Arguments:</b>
* OWL_File: <i>File containing graph to embed in owl format.</i>
* Triple_File: <i>Path to ".tsv" style to store graph tripels.</i>
* Model_Name: <i>Model to use for embeddings, e.g. TransR or TransE. Model must be supported by PyKeen.</i>
* Dimension: <i>Target dimension of embeddings.</i>
* Target_File: <i>File to save generated embeddings in. Should be of type ".tsv".</i>
<b>NOTE</b>: Depending on the hardware used to generate the embeddings, this can take quite a bit of time.
#### Output
Outputs embeddings into Target_File in the format \<entity_name>\\t\<embedding>:\
[...]\
......@@ -150,6 +150,14 @@ http://dl-learner.org/carcinogenesis#Carbon-27 [-1.8802489, 2.185924, -0.7223453
#### Output
Output predictions for all learning problems in turtle syntax into predictions.ttl
## Other Approaches
We tried out several different approaches to tackle the given task of classifying entities. These approaches can be found
in the folder "other_approaches" as Jupyter notebooks.
### SKLearn Clustering
In the notebook "dbscan_clustering.ipynb" we explored the possibility to use clustering algorithms defined in SKLearn to classify the given entities. Here we choose DBSCAN, as SKLearn states it working well with imbalanced datasets. Unfortunately, the approach did not yield good results and was therefore no longer pursued.
### PyTorch Geometric Graph Neural Network
A second approach was the implementation of a graph neural network from the library pytorch_geometric, i.e. a deep learning approach. The idea was to use a graph neural network for classification based on the labels of the learning problems and the edges of the knowledge graph. The first step was to fit the network using the train data and the CrossEntropyLoss as metric and, after that, classify all individuals (even the ones used for training). The network computes a probability distribution over the labels for each individual and the individuals are assigned to the class with the highest probability. However, since the data are very imbalanced, all individuals are assigned to the negative (excluded) class and the F1-score was not very meaningful. Unfortunately, it was not possible to find a solution for this problem, hence this approach was no longer in our interest.
<!-- CONTACT -->
## Contact
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment