README.md 6.94 KB
Newer Older
Mohness Waizy's avatar
Mohness Waizy committed
1
2
3
<!-- PROJECT LOGO -->
<br />
<p align="center">
Mohness Waizy's avatar
Mohness Waizy committed
4
  <a align="center" href="https://git.cs.uni-paderborn.de/lgehring/lsm.git">
Mohness Waizy's avatar
Mohness Waizy committed
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
    <img src="pexels-pixabay-373543.jpg" alt="Logo" width="1000" height="200">
  </a>

  <h3 align="center">LSM Group: Knowledge Graphs - Mini Project - Summer Term 2021</h3>

  <p align="left">
    This repository is the presentation of our mini-project for the Foundations of Knowledge Graphs lecture at Paderborn University in Germany. 
    We were provided with 25 learning problems from the Carcinogenesis dataset, each having included and excluded components.The task has been to develop a classifier that can determine the carcinogenicity of new components based on the learning problems from the Carcinogenesis dataset. 
  </p>
</p>



<!-- TABLE OF CONTENTS -->
<details open="open">
  <summary><h2 style="display: inline-block">Table of Contents</h2></summary>
  <ol>
    <li>
      <a href="#approach">Approach</a>
      <ul>
        <li>
          <a href="#prerequisites">Prerequisites</a>
          <ul>
            <a href="#libraries">Libraries</a>
          </ul>
        </li> 
      </ul>
    </li>
    <li>
      <a href="#getting-started">Getting Started</a>
    </li>
    <li>
    <a href="#usage">Usage</a>
    </li>
    <li><a href="#contact">Contact</a></li>
  </ol>
</details>



<!-- APPROACH -->
## Approach
lgehring's avatar
lgehring committed
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
We decided on using embeddings to represent the carciogenesis dataset in an efficient form.
This was done using the PyKeen library, which offers a myriad of different embedding models.
Further it can be configured with different parameters like the number of epochs, or the dimension
of the generated embedding.

To make predictions using these embeddings, we first used typical machine learning algorithms such as
random forests, logistic regression, or clustering algorithms such as kNN. In doing so, we encountered
the problem that very many of the learning problems have a very unbalanced ratio of positive and negative
(included and excluded) instances.
For learning problems that had an extremely high proportion of negative (excluded) instances, 
the classification algorithms classified all instances as negative, since these mostly optimize the accuracy
instead of the F1 score.
To overcome this problem, we tried to balance the training data before the training. Since undersampling,
with a very small amount of positive instances leads to a very small training data set,
we therefore decided to oversample. The oversampling algorithm we used is the SMOTE implementation of the sklearn extension
imbalenced-learn (https://github.com/scikit-learn-contrib/imbalanced-learn). In simple terms, SMOTE calculates
new synthetic data points for the smaller class, each of which lies on the line between two data points of this class.
Using this technique and a Linear SVM, we were able to at least slightly improve the problem of overweighting the negative class.

Using this we could achieve F1-Scores ranging from \<lower_bound> up to \<higher_bound> for the given test lps.
We split the data in to learning and test in a ratio of \<ratio>.
Mohness Waizy's avatar
Mohness Waizy committed
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84


<!-- PREREQISITES -->
### Prerequisites

* Python (3.8.10)??? 
* pip3 (20.0.2)

<!-- LIBRARIES -->
#### Libraries
* pykeen (1.5.0)
* seaborn (0.11.1)
* pandas (1.1.0)
* matplotlib (3.3.0)
* numpy (1.19.1)
* rdflib (5.0.0)
* scikit_learn (0.24.2)
lgehring's avatar
lgehring committed
85
86
87
* imbalanced-learn
* scipy (>=0.19.1)
* joblib(>=0.11)
Mohness Waizy's avatar
Mohness Waizy committed
88
89
90
91
92
93
94


<!-- GETTING STARTED -->
## Getting Started
<!-- INSTALLATION-OF-PREREQUISITES -->
### Installation of Prerequisites

markus's avatar
markus committed
95
1. Python > 3.6.9 For example:
Mohness Waizy's avatar
Mohness Waizy committed
96
97
98
99
  ```sh
  sudo apt install python3.8.10
  ```
3. Clone the repo
markus's avatar
markus committed
100
101
102
  ```sh
  git clone https://git.cs.uni-paderborn.de/lgehring/lsm.git
  ```
Mohness Waizy's avatar
Mohness Waizy committed
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
4. Install PyKEEN
  ```sh
  pip3 install pykeen
  ```
5. Install Seaborn
  ```sh
  pip3 install seaborn
  ```
6. Install Pandas
  ```sh
  pip3 install pandas
  ```
5. Install Matplotlib
  ```sh
  pip3 install matplotlib
  ```
7. Install Numpy
  ```sh
  pip3 install numpy
  ```
8. Install Rdflib
  ```sh
  pip3 install rdflib
  ```
9. Install Rdflib
  ```sh
  pip3 install scikit_learn
  ```
10. Install [TODO]
  ```sh
  pip3 install [TODO]
  ```


<!-- USAGE -->
## Usage
<!-- BUILD-EMBEDDINGS -->
### Build Embeddings
#### Command
markus's avatar
markus committed
142
143
144
In order to build the embeddings, the file "pykeen_embeddings.py" is used. After importing it, 
embeddings can be generated via the generate_and_save_embeddings([...]) function. Further, 
the file can be on its own to generate embeddings:
Mohness Waizy's avatar
Mohness Waizy committed
145
  ```sh
markus's avatar
markus committed
146
  python pykeen_embeddings.py [-h] OWL_File Triple_File Model_Name Dimension Target_File
Mohness Waizy's avatar
Mohness Waizy committed
147
  ```
markus's avatar
markus committed
148
149
150
151
152
153
  positional arguments:
  * OWL_File:     File containing graph to embed in owl format.
  * Triple_File:  Path to .tsv style to store graph tripels
  * Model_Name:   Model to use for embeddings, e.g. TransR or TransE. Model must be supported by PyKeen.
  * Dimension:    Target dimension of embeddings.
  * Target_File:  File to save generated embeddings in. Should be of type .tsv.
Mohness Waizy's avatar
Mohness Waizy committed
154
#### Output
markus's avatar
markus committed
155
156
157
158
159
160
Outputs embeddings into Target_File in the format \<entity_name>\\t\<embedding>:\
[...]\
http://dl-learner.org/carcinogenesis#Carbon-232	[-0.31581488, 0.8278251, 1.8727547, -0.041085266, 0.014574329, -0.67168427, 2.0599394, 0.8136213, 0.6940255, -0.46925345, 0.08348891, 2.1310937, 0.48579377, 0.8463386, -0.6833736, -0.5553089, -2.0224242, -1.2030263, 0.6667486, -0.6958244, 2.0900614, 1.4396154, -1.1683109, -0.86062014, -0.65684366, 0.37350407, 1.6208519, 0.8363482, 0.0754768, 0.23654181, -1.4851497, 0.88049734]\
http://dl-learner.org/carcinogenesis#Carbon-26	[-0.103435785, -1.2094345, -0.18882215, 2.0368776, 0.3304364, -1.7264694, -0.38451058, 0.06835548, -1.3024201, 0.16077128, -0.6984507, -0.29645622, 0.021067962, 1.4021096, 1.9172877, -2.2997203, 1.0408328, 0.24595535, -0.0757225, 0.41191146, -0.24012361, -1.6148175, -0.9519527, -0.0012898605, -0.24245678, 0.5220458, 0.28011653, 0.27396503, -0.09945937, 1.8605173, -1.373711, -1.4735564]\
http://dl-learner.org/carcinogenesis#Carbon-27	[-1.8802489, 2.185924, -0.7223453, -1.0277753, 1.2828372, -1.8145577, 0.041590724, -0.24165802, -0.5704698, 0.93525743, -0.9134435, 0.8481486, 0.46955204, -0.47266957, -2.4214704, -0.6310501, -1.1237596, -2.3589735, 0.37650838, 1.8736081, -0.9354778, -0.65831023, -1.2054998, 1.0181395, 0.5560374, -0.12456948, 0.40127212, -0.046274118, -1.456181, 1.7935433, -0.41356027, 0.081598125]\
[...]
Mohness Waizy's avatar
Mohness Waizy committed
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
### PCA Plot Embeddings (Optional)
#### Command
  ```sh
  pip3 install [TODO]
  ```
#### Output
* [TODO] List output
<!-- RUN-CLASSIFIER -->
### Run classifier
#### Command
  ```sh
  pip3 install [TODO]
  ```
#### Output
* [TODO] List output

#### [TODO]
#### Command
  ```sh
  pip3 install [TODO]
  ```
#### Output
* [TODO] List output


<!-- CONTACT -->
## Contact

Mohness Waizy's avatar
Mohness Waizy committed
189
190
191
192
* Lukas Gehring - lgehring - 7082490 - lgehring@mail.uni-paderborn.de
* Sven Meyer    - svemey98 - 7133064 - svemey98@mail.uni-paderborn.de
* Markus Röse   - mroese - 7087673 - mroese@mail.uni-paderborn.de
* Mohness Waizy - waizy - 7120556 - waizy@mail.upb.de
Mohness Waizy's avatar
Mohness Waizy committed
193
194
195

Project Link: [https://git.cs.uni-paderborn.de/lgehring/lsm.git](https://git.cs.uni-paderborn.de/lgehring/lsm.git)