Skip to content
Snippets Groups Projects
Commit 4051e835 authored by Jayesh's avatar Jayesh
Browse files

updated A2W2T1

parent f68704ec
No related branches found
No related tags found
No related merge requests found
%% Cell type:code id: tags:
``` python
pip install scikit-learn
```
%% Output
Collecting scikit-learn
Downloading scikit_learn-1.5.0-cp310-cp310-win_amd64.whl.metadata (11 kB)
Requirement already satisfied: scikit-learn in e:\dsse\dsse-group-7\.venv\lib\site-packages (1.5.0)Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: numpy>=1.19.5 in e:\dsse\dsse-group-7\.venv\lib\site-packages (from scikit-learn) (1.26.1)
Requirement already satisfied: scipy>=1.6.0 in e:\dsse\dsse-group-7\.venv\lib\site-packages (from scikit-learn) (1.13.0)
Requirement already satisfied: scipy>=1.6.0 in e:\dsse\dsse-group-7\.venv\lib\site-packages (from scikit-learn) (1.12.0)
Requirement already satisfied: joblib>=1.2.0 in e:\dsse\dsse-group-7\.venv\lib\site-packages (from scikit-learn) (1.4.2)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
Downloading threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.5.0-cp310-cp310-win_amd64.whl (11.0 MB)
---------------------------------------- 0.0/11.0 MB ? eta -:--:--
- -------------------------------------- 0.3/11.0 MB 17.7 MB/s eta 0:00:01
--- ------------------------------------ 0.9/11.0 MB 10.9 MB/s eta 0:00:01
----- ---------------------------------- 1.4/11.0 MB 11.2 MB/s eta 0:00:01
------ --------------------------------- 1.8/11.0 MB 11.7 MB/s eta 0:00:01
-------- ------------------------------- 2.3/11.0 MB 12.1 MB/s eta 0:00:01
---------- ----------------------------- 2.8/11.0 MB 12.1 MB/s eta 0:00:01
----------- ---------------------------- 3.3/11.0 MB 11.6 MB/s eta 0:00:01
------------- -------------------------- 3.7/11.0 MB 11.8 MB/s eta 0:00:01
--------------- ------------------------ 4.2/11.0 MB 11.7 MB/s eta 0:00:01
---------------- ----------------------- 4.6/11.0 MB 11.9 MB/s eta 0:00:01
------------------ --------------------- 5.2/11.0 MB 12.3 MB/s eta 0:00:01
-------------------- ------------------- 5.6/11.0 MB 11.6 MB/s eta 0:00:01
---------------------- ----------------- 6.2/11.0 MB 11.6 MB/s eta 0:00:01
------------------------ --------------- 6.7/11.0 MB 11.6 MB/s eta 0:00:01
-------------------------- ------------- 7.2/11.0 MB 11.8 MB/s eta 0:00:01
--------------------------- ------------ 7.6/11.0 MB 11.9 MB/s eta 0:00:01
----------------------------- ---------- 8.1/11.0 MB 11.8 MB/s eta 0:00:01
------------------------------- -------- 8.6/11.0 MB 11.7 MB/s eta 0:00:01
--------------------------------- ------ 9.1/11.0 MB 11.9 MB/s eta 0:00:01
--------------------------------- ------ 9.3/11.0 MB 11.9 MB/s eta 0:00:01
--------------------------------- ------ 9.3/11.0 MB 11.9 MB/s eta 0:00:01
--------------------------------- ------ 9.3/11.0 MB 11.9 MB/s eta 0:00:01
--------------------------------- ------ 9.3/11.0 MB 11.9 MB/s eta 0:00:01
----------------------------------- ---- 9.8/11.0 MB 10.0 MB/s eta 0:00:01
------------------------------------ --- 10.1/11.0 MB 9.9 MB/s eta 0:00:01
------------------------------------- -- 10.2/11.0 MB 9.4 MB/s eta 0:00:01
------------------------------------- -- 10.2/11.0 MB 9.3 MB/s eta 0:00:01
------------------------------------- -- 10.2/11.0 MB 8.9 MB/s eta 0:00:01
------------------------------------- -- 10.2/11.0 MB 8.7 MB/s eta 0:00:01
-------------------------------------- - 10.5/11.0 MB 8.5 MB/s eta 0:00:01
--------------------------------------- 11.0/11.0 MB 8.5 MB/s eta 0:00:01
---------------------------------------- 11.0/11.0 MB 8.3 MB/s eta 0:00:00
Downloading threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, scikit-learn
Successfully installed scikit-learn-1.5.0 threadpoolctl-3.5.0
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: threadpoolctl>=3.1.0 in e:\dsse\dsse-group-7\.venv\lib\site-packages (from scikit-learn) (3.5.0)
%% Cell type:code id: tags:
``` python
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
issue_data_df = pd.read_excel("E:\\DSSE\\DSSE-Group-7\\Assignment_2\\Week1\\yarn_issue_data_cleaned.xlsx")
issue_data_df['Summary_Description_Tokens_Str'] = issue_data_df['Summary_Description_Tokens'].apply(lambda x: ' '.join(eval(x)))
print(issue_data_df[['Issue key', 'Summary_Description_Tokens_Str']].head(10))
issue_data_df = issue_data_df[issue_data_df['Summary_Description_Tokens_Str'].str.strip() != '']
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
dtm = vectorizer.fit_transform(issue_data_df['Summary_Description_Tokens_Str'])
dtm_df = pd.DataFrame(dtm.toarray(), columns=vectorizer.get_feature_names_out())
dtm_df.insert(0, 'Issue key', issue_data_df['Issue key'].values)
output_path = "E:\\DSSE\\DSSE-Group-7\\Assignment_2\\Week 2\\yarn_document_term_matrix.csv"
dtm_df.to_csv(output_path, index=False)
print(f"Document-term matrix saved to: {output_path}")
```
%% Output
Issue key Summary_Description_Tokens_Str
0 YARN-10930 introduce universal configure capacity vector ...
1 YARN-10562 follow change yarn yarn race condition directo...
2 YARN-10514 introduce dominant resource base schedule poli...
3 YARN-10494 cli tool dockertosquashfs conversion pure java...
4 YARN-10493 runc container repository v current runc conta...
5 YARN-10344 sync netty version hadoopyarncsi nettyall fina...
6 YARN-10335 improve schedule container base node health ya...
7 YARN-10238 clone hadoopsls simulate huge scale yarn yarn ...
8 YARN-10225 support amd rocm gpus yarn hi watch seem hop s...
9 YARN-10071 sync mockito version module yarn introduce moc...
Document-term matrix saved to: E:\DSSE\DSSE-Group-7\Assignment_2\Week 2\yarn_document_term_matrix.csv
......
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment