Reading files with Python
For Python users involved in molecular modeling, reading files is likely a major part of analyzing results. Here I show a few potentially useful ways to approach file handling with Python. This list is in no way exhaustive. It is intended to serve as a brief overview for beginners.
1. Reading in coordinates from an XYZ file
In an earlier post, I discussed the XYZ file format and its use in molecular modeling. There are many reasons why one might need to read in the nuclear coordinates of a molecule from an XYZ file. Suppose you're trying to map a potential energy curve as a function of some internuclear distance, bond angle, or dihderal torsion. You'll want a way to calculate the distance in each structure used to map your curve, that doesn't involve opening a GUI software to measure the distance "by hand." Here is an example of how to open an XYZ file and read in its contents.
----------------------------------------------------
f = input('Enter name of XYZ file: ')
xyz = open(f)
N = int(xyz.readline())
header = xyz.readline()
atom_symbol, coords = ([] for i in range (2))
for line in xyz:
atom,x,y,z = line.split()
atom_symbol.append(atom)
coords.append([float(x),float(y),float(z)])
xyz.close()
----------------------------------------------------
2. Extracting a specific portion of an output file
Now suppose there is are specific quantities you want to extract from the results of an electronic structure calculation. These quantities could be: molecular orbital energies, nuclear coordinates of the optimized structure, atomic partial charges from population analysis, and many others. Sometimes this can be challenging given that output files contain a lot of information and are made up of: text, numbers, and symbols.
Here is an example showing one way to find the optimized cartesian coordinates in a MOPAC output (.out) file. The coordinates always appear after the second instance of the string, 'CARTESIAN COORDINATES'.
----------------------------------------------------
import numpy as np
with open("ouput.out", "r") as f:
content = []
cartesian = 0
for line in f:
if cartesian >= 2:
content.append(line)
if 'CARTESIAN COORDINATES' in line:
cartesian += 1
coords = np.array(content)
f.close()
----------------------------------------------------
In the above example, the array 'coords' can be further manipulated depending on the goal of the analysis.
3. Reading in data from an Excel sheet
Excel is a powerful tool when used correctly. Sometimes it may be useful to use both Python and Excel for analysis. Here is a simple example of how to read in values from specific columns in a chosen sheet of an Excel workbook. For further information, I point the reader to the xlrd documentation.
----------------------------------------------------
import xlrd
book = xlrd.open_workbook('mybook.xlsx')
sheet = book.sheet_by_name('SHEET1')
G = 29
atoms = [[sheet.cell_value(r, c) for c in range(1)] for r in range(2,G+2)]
x_vals =[[sheet.cell_value(r, c) for c in range(1,2)] for r in range(2,G+2)]
----------------------------------------------------
4. Reading in a large number of files
In some instances, one may need to read in a large number of files. It would then be impractical to specify each file name indvidually. Here is an example showing one way to approach reading in a set of .dat files, each containing one vector. For further information, I point the reader to the glob documentation.
----------------------------------------------------
import glob
import os
vec = []
files = sorted(glob.glob('*.dat'))
for file in files:
f = open(file,'r')
for line in f:
vec.extend(line.split())
f.close()
----------------------------------------------------