联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-20:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2021-12-09 09:10

CHEN E4580 Artificial Intelligence in Chemical Engineering Fall 2021

Final Project Due Date: Dec 15, 2021

The final project for the course carries 60% of your final grade. You can choose to do the

problem below, or an individual problem related to chemical engineering with prior

permission from the instructor or TA. In the event of you choosing a separate problem,

the problem has to be complex enough to justify the necessary grade. In either case, the

team/group that you are assigned to remains the same.

Your final grade will be based on the report submitted for the project. This report should

be in the form of a research article with a title, list of authors, an abstract, introduction,

methods, results and discussion, conclusions that highlight the major findings of your

work, and a list of references. There are no constraints in terms of the number of pages

or the word limit but try not to exceed 10 pages (excluding references). Excessive and

unnecessary verbosity will be penalized.

You need to submit your codes as separate .py or .ipynb files. A major part of the project

is to utilize the methods learnt during the course and demonstrate a strong knowledge of

at least a couple of methods (including visualization techniques) learnt during the course.

Problem Statement: Thermodynamic Property Prediction

Computational or data-driven property prediction is an important task that helps in

minimizing the experimentation effort required to measure and estimate molecular

properties. Most molecules are either difficult to procure or expensive to synthesize under

normal laboratory conditions. In addition, property measurement experiments are often

challenging. This necessitates the development of data-driven property estimation tools

that could predict such properties. These tools are data-driven in nature and utilize the

existing databases of chemical properties to learn underlying relationships between

molecular features and their physical properties.

One way to build such tools is to train (separate) regression models that relate the

molecular property descriptors or features to their property values. Such models would

take as input the molecular descriptors and give as output the predicted value of the

property of interest.

Molecular Representation

In order to predict molecular properties using data-driven algorithms, we need a way to

represent the molecules numerically so that they could be fed to a regression algorithm.

This could be done in several ways but two common approaches are group contribution

and mol2vec-based methods.

1. Group Contribution: Molecules are often characterized by the presence of

several different groups and they could be used to capture the chemical and

structural properties that define their thermodynamic properties. The idea behind

2

group contribution features is to represent the frequency of occurrence (or

absence) of each functional group in a molecule as a vector. These vectors could

be used as features in regression models trained for the property prediction task.

A recent work that used these representations for the property prediction task

available here (paper, codebase). The data that has been given to you is

primarily based on this work and reading this paper is strongly

recommended. Molecules are represented using their SMILES string

representations (paper).

2. Mol2vec: Mol2vec is an unsupervised machine learning approach to learn vector

representations of molecular substructures. The idea is primarily based on the

word2vec1 algorithm that is used for learning vector representations of English

words in natural language processing. The mol2vec algorithm learns vector

representations of molecules such that the molecular substructures that are

chemically related point in similar directions. The research article where mol2vec

was first published is posted on the course website and could also be accessed

here.

Since mol2vec requires a cheminformatics package called RDKit to be installed,

you need to first want install RDKit using the following command:

> pip install rdkit-pypi

In order to understand the implementation details of mol2vec, you can go through

the mol2vec documentation. The mol2vec package could be installed using the

following command:

> pip install git+https://github.com/samoturk/mol2vec

You’re given a pre-trained mol2vec model (model_300dim.pkl) that could be used

to generate features from SMILES strings. Refer to the following code snippet that

explains generating features using mol2vec.

> import pandas as pd

> import numpy as np

> from rdkit import Chem

> from gensim.models import word2vec

> from mol2vec.features import mol2alt_sentence, mol2sentence,

MolSentence, DfVec, sentences2vec

> from gensim.models import word2vec

> mdf= pd.read_csv('example_dataset.csv').iloc[:,:2] # read the

dataset as a pandas dataframe with only the first two columns

> mdf.columns = ['smiles', 'target'] # rename the columns to

'smiles' and 'target'

> mdf = mdf.astype(object) # change data type to object (required

to store molecule objects in the dataframe later)

1 You can get a very intuitive understanding of the word2vec algorithm here.

3

> mdf.head() # look at the first few rows in the dataframe

> mdf['mol'] = mdf['smiles'].apply(lambda x: Chem.MolFromSmiles(x))

# use rdkit to convert smiles strings to 'molecule' objects and

store in a new column 'mol'

> model = word2vec.Word2Vec.load('model_300dim.pkl') # load the

pre-trained mol2vec model. This file is provided to you separately

> mdf['sentence'] = None # create a new column 'sentence' to store

'molecular sentences'

> for i in range(mdf.shape[0]):

# the folloing 'try except' code block is require to skip

erroneous molecules that could not be processed by rdkit

try: # try the following lines of code

m = mdf['mol'][i]

mdf.loc[i,'sentence'] = MolSentence(mol2alt_sentence(m, 1))

except: # do the following if there's an exception (or error)

while executing the previous lines

mdf.loc[i,'sentence'] = None

print('skipped: {}'.format(mdf['smiles'][i]))

> mdf.dropna(inplace=True) # drop nans

> mdf.head()

# generate vector representations of molecules using 'molecular

sentences' and pre-trained mol2vec model

> mdf['mol2vec'] = [DfVec(x) for x in sentences2vec(mdf['sentence'],

model, unseen='UNK')]

> X = np.array([x.vec for x in mdf['mol2vec']]) # feature matrix

> y = mdf['target'].values.astype(np.float32) # target values

3. Other representations: There are several other representations for molecules

such as the SMILES grammar-based representations (paper, codebase). In

addition, a molecular descriptors calculation software called Mordred (paper,

codebase) is freely available online that provides an exhaustive set of features that

could be used for property prediction tasks. You may go through these papers

and use them (even partially) in your work and contrast their

advantages/disadvantages with the mol2vec and group contribution

representations.

Visualization

4

Irrespective of the type of molecular representation that you decide to work with, visualize

a subset of molecules to see if chemically similar molecules are closer to each other.

Ideally, the molecular representations should capture the underlying similarities and

differences between molecules to the maximum possible extent. Since you have high

dimensional features, you may use t-SNE technique to project the features two 2-

dimensions and then visualize them using a scatter plot.

More information on this method could be found here. The research article that proposed

t-SNE has been added to the project folder.

Regression

The property prediction problem could be modeled as a regression task where you may

try different regression methods covered in class such as linear regression, polynomial

regression, support vector regression, decision trees, random forests, k-NN, and so on.

Do not forget to split the given data into train, test, and validation sets, perform appropriate

regularization, tune model hyperparameters, and transform the data using kernels, if

required.

Report the final model performance along with all the performance metrics for the

regression task that capture the performance of your model.

Dataset description

Each team is given two datasets containing data on two different properties that need to

be predicted. The first column contains the SMILES strings of molecules, the second

contains the true property values, and the rest of the 424 columns are the group

contribution-based features. These 424 features encode the presence of different

functional groups and their frequency of occurrence in a given molecule. You may use all

of these columns as features while training the group-contribution based regression

models.

In order to generate the features based on other representations (such as mol2vec), you

need to use the SMILES strings of molecules and pass them to the mol2vec model as

explained in the section on mol2vec above.

Note that you need to train at least two separate models – one using group contribution

representations and the other using any other choice of molecular representations for

each of the properties.

Provided files

Here’s an overview of the files are you are provided with:

5

S.No. Filename Description

1 teams.pdf Project teams-assignment

2 property-assignment.pdf Property assignment for each team

3 datasets Datasets for all groups. You just need to work with

the dataset assigned to your team

4 model_300dim.pkl Pre-trained mol2vec model

5 papers Folder containing all the relevant research articles

The following commands would be useful while reading/writing the dataset files:

numpy.loadtxt() # for loading text files

pandas.read_csv() # for loading csv files as pandas dataframes

pandas.read_excel() # for loading excel files as pandas dataframes

pandas.to_csv() # write dataframe to csv file

pandas.to_excel() # write dataframe to excel file


相关文章

版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。