The CPU penalty of VL packing is very low. Actually the new code is faster in some situations because the memory is accessed in a very regular manner, and because the memory bandwidth is reduced. The logic for packing/unpacking is also OpenMP aware.
This development follows the suggestion by Millard Alexander to exploit the sparsity of the coupling arrays, as it is already done within the HIBRIDON package.
This mesh-based algorithm is presently available for ITYPE=3,4 only. It is automatically disabled for other ITYPE values.
The algorithm switches automatically between in-core and out-of-core mode.
The in-core mode requires NRDIST*N*(N+1) storage, where NRDIST is the number of distances in the mesh. This is to be compared to the conventional VL storage which requires MXLAM*(N*(N+1)/2, or less if packed VL storage is enabled. Despite its memory requirements, the in-core mode is very efficient, because i) the in-core data is produced by a large DGEMM, and ii) because the amount of data read at each integration step is limited to 2*N*(N+1), reducing significantly the memory bandwidth requirements in OpenMP operation.
The out-of-core mode dumps all the NRDIST*N*(N+1) storage to disk, and reads back 2*N*(N+1) data per integration step (corresponding to the W matrices and corresponding second order derivative values for the current distance interval). Apart some I/O buffer space, the VL storage is completely suppressed. This should permit to run very large problems on machines with limited memory, including GPUs. As the computational cost of each integration step scales as N3 while the I/Os scale as N2, the I/O overhead should not be a significant limitation for such very large problems.
A less I/O demanding intermediate mode could be easily implemented using an additional storage of 2*N*(N+1) real*8 words corresponding to the W matrices and its second order derivative for the current distance interval. This storage would permit to reduce the reading by a factor of two or more when integration steps are smaller or comparable to the RDIST distance intervals because i) no reading would be needed for a subsequent integration step in the same RDIST interval, and ii) the reading of N*(N+1) data only would be required to refresh the W in-core ansatz for a subsequent integration step in an adjacent RDIST interval. However if the MOLSCAT storage is set sufficiently below the available computer memory, the disk caching provided implicitely by modern system kernels is expected to provide most benefits of this intermediate mode.
Note that the W array is generally not sparse, because it is a weighted sum of the coupling matrices corresponding to all the expansion terms of the potential. While these individual coupling matrices are sparse, their zeros are not expected to match. Consequently there is no benefit for combining the mesh-based and the packed approaches, and VLTHRESH is ignored.
Note that there is no provision for short-range nor long-range interpolations. The behavior of subroutine FINDRM is slightly altered to attempt to keep RMIN larger than RDIST(1). Potential or W values for R<RDIST(1) are blindly extrapolated to their values at R=RDIST(1). Potential or W values for R>RDIST(NRDIST) are blindly extrapolated to 0. This behavior might be inconvenient for problems involving very long-range effects such as ultra-cold collisions.
If POTENL is already spline-interpolated over a distance mesh, this mesh can provide a good startup. It is generally necessary to add a few distances at shorter and larger range to account for the short-range and long-range extrapolations provided by POTENL, if any.
export OMP_NUM_THREADS=1
export OMP_NUM_THREADS=2 # number of OpenMP threads
export OMP_SCHEDULE="DYNAMIC,32" # scheduling strategy
ulimit -s unlimited
$ molscat.x 10000000
The code assumes that the offset (in real*8 words) between the allocated vector and X vector in /MEMORY/ fits in an integer.
To enable the run-time workspace allocation, recompile main.f with PARAMETER (MXDIM=10) and make sure that the call to alloc_memory is uncommented in driver.f.
Modified files: set6c.f, chkstr.f, driver.f. Added routines:
alloc_memory and alloc_vm. See comments in main.f and driver.f for
more details.
Utility routines
A large set of utility routines is provided to suit most architectures.
gclock_f95.f provides elapsed cpu time in seconds
gclock_mpi.f wallclock MPI variant
gclock_dummy.f dummy version
gelapsed.f provides elapsed wall time in seconds
timing_.c C-coded utility routine called by gelapsed.f
gelapsed_mpi.f wallclock MPI portable variant
gelapsed_dummy.f dummy version
gdate_f95.f provides date or time as a character string
gdate_dummy.f corresponding dummy version
flshfo.f flush output buffer, calls flush library routine
flshfo_aix.f same for machines running AIX
flshfo_dum.f corresponding dummy version
fhost_.c returns the host name (C routine)
fhost_dummy.f corresponding dummy version
alloc_memory.f enables dynamic allocation of X array, calls
alloc_vm. Call be commented out in driver.f.
alloc_vm_f90.f portable alloc_vm for most f90 compilers
alloc_vm_f77_32.f f77 interface to get_vm_.c (32-bit architecture)
alloc_vm_f77_64.f f77 interface to get_vm_.c (64-bit architecture)
get_vm_.c corresponding allocator routine written in C
alloc_vm_gfortran.f substitute to alloc_vm_f90.f if intrinsic LOC
is not available
loc8_.c corresponding substitute to LOC routine
Also, the original utility routine PARITY has been renamed FARITY
to avoid a name conflict in some system libraries.
Makefile templates for various platforms
Presently supported platforms are:
The distribution provides a flexible Makefile template with extensive comments. The Makefile includes compilation directives provided by a variety of compil*.'arch' files corresponding to various compilers, BLAS/Lapack libraries and machine architectures. These files also select pmpnone.o for serial or OpenMP runs, or pmpnew.o for MPI runs. They also provide a guess for the fastest matrix inversion and diagonalisation routines for your architecture (see discussion below).
The naming convention is the following:
At run time the code prints a summary of the build options (MPI and/or OpenMP, recoupling algebra routines) together with the number of threads and MPI tasks.
The source code has been splitted to
individual routines to facilitate the development and reduce
compilation time. The script tags.sh generates the appropriate TAGS
file to navigate the source with the emacs editor (see comments in
tags.sh).
In addition to the Makefile, the script compil.sh can be used to
build the following utility programs:
As a rule of thumb, the MKL library is best for Intel processors, and the ACML library is best for AMD processors. These libraries provide both Lapack and BLAS routines. On Power-based mainframes, you should use the latest Lapack library (currently 3.1.1) and invoke IBM's ESSL library for providing the underlying BLAS. Beware that some libraries (ACML, ESSL, GotoBLAS...) are provided in distinct versions for serial and multi-threaded applications, thus make sure you link to the proper one.
In addition the present distribution provides two choices for matrix inversion and diagonalisation routines. The optimal routines should be selected in the compil* file enabled in your Makefile. The best choice depends on your hardware and on your local Lapack and BLAS libraries, and is also generally different for serial or multi-threaded runs. Beware that the performance impact may be quite important.
Cumulative update to distribution of 12 May 2008.
27 May 2008, PV/LAOG.
The tar file should be unpacked under directory distrib_20080512/
Then issue
$ rm compil*.dsy* compil*.dge*
$ rm f02abf.* f02aaf.* f02abf_ori.* f02aaf_ori.*
$ rm cpl3*.o cpl4*.o init_vl_mesh.o spropn.o wavvec.o
$ make
- The sub-directory linear_bench/ provides various files to
exercise the performance of the local Lapack and BLAS libraries
and to select the most appropriate inversion and diagonalisation
routines, as discussed above.
- The Makefile and the compil* files have been modified to permit
the selection of dsyevr- or dsyevx-based routines f02abf and
f02aaf. The compil* names have been also shortened.
- Minor changes in cpl3*.f cpl4*.f init_vl_mesh.f spropn.f wavvec.f
to comply with some compilers.
Affected files:
Makefile compil* cpl3*.f cpl4*.f init_vl_mesh.f spropn.f wavvec.f
New files:
f02abf_dsyevr.f f02aaf_dsyevr.f f02abf_dsyevx.f f02aaf_dsyevx.f
(renamed from f02abf.f, f02aaf.f, f02abf_ori.f, f02aaf_ori.f).
New sub-directory: linear_bench/
19 May 2008, PV/LAOG.
Fixed files compil_omp.aix.xlf64.dsy and compil_omp_mpi.aix.xlf64.dsy
to invoke the esslsmp threaded library instead of the serial essl.
Update to distribution of 12 May 2008.
14 May 2008, PV/LAOG and George C. McBane.
The tar file should be unpacked under directory distrib_20080512/
Then issue
$ rm wavvec.o
$ make
This update provides OpenMP support in WAVVEC for ITYPE = 2 and 7
(contributed by George C. McBane).
The scaling is reasonable, though limited by memory bandwidth in
some cases; the P() array is not accessed in a cache-friendly way.
My test case (N=504,NPOTL = 25, MXLAM = 114114, on 1.3 GHz Power4)
gave the following:
threads WAVVEC time in DAPROP
1 16.0 s
2 8.1
4 4.5
8 3.0
New distribution, unpacks under directory distrib_20080512/
12 May 2008, PV/LAOG. Mesh-based algorithm for the couplings:
- Rewritten and the mesh-based algorithm in-core and out-of-core
storage of the coupling elements (W matrix).
- Updated accordingly the utility storage.x to estimate workspace
required for MOLSCAT calculations.
Mostly affected files: driver.f, init_vl_mesh.f, init_pv2.f,
cpl3_pv2.f, cpl4_pv2.f, wirput.f, wavwir.f, wavmat.f, storage.f.
New distribution, unpacks under directory distrib_20080505/
5 May 2008, PV/LAOG. Refurbished source code:
- Fixed a missing private variable in daprop.f
- Added DEFAULT(NONE) to all OMP PARALLEL constructs
- Established locks for warning messages as implemented in couple.f
- Performed various cosmetic updates of source code and comments.
- Fixed the proper ordering of the MKL library components in the
compil*.'arch' files.
- Merged files smerge.f and sdata.fi with the update posted
by George C. McBane on his repository, April 25, 2008.
2 May 2008, PV/LAOG.
Introduced a sparsity-aware algorithm for packing the VL array.
The CPU penalty is very low. Actually the new code is faster in
some situations because the memory is accessed in a very regular
manner, and because the memory bandwidth is reduced.
The logic for packing/unpacking is also OpenMP aware.
- VL data below VLTHRESH in absolute value is discarded;
- VL data above VLTHRESH is packed in a very compact form
requiring a single integer index per real*8 value.
( See comments in sparsity_inc.f and chunk.f for more details. )
The new algorithm is presently available for ITYPE=3,4 and is
selected by defining a VLTHRESH value in namelist &INPUT.
Selecting a low VLTHRESH value (typically 1.0d-12) should not
affect the numerical results.
The fraction of VL couplings above 1.0d-12 is now printed out
for PRINT.ge.3 and typically lies in the range 0.12 to 0.3.
The size of the VL array with the conventional logic is
~ 4*MXLAM*N^2/1024^3 (in GB)
The size of the packed VL array with the new logic is
~ fraction*1.5 * size (if NIPR=2, default integer*4)
~ fraction*2.0 * size (if NIPR=1, default integer*8)
The memory saving is thus larger for NIPR=2, reducing the
conventional storage by a factor 2 to 5.
Affected routines: driver.f, couple.f, wavmat.f
New routines: cpl3_pv3.f, cpl4_pv3.f, wavvec_pv3.f, chunk.f
New include file: sparsity_inc.f
This development follows the suggestion by Millard Alexander to
exploit the sparsity of the coupling arrays, as it is already done
within the HIBRIDON package.
2 May 2008, PV/LAOG.
Provided the utility storage.x to estimate the workspace required
for MOLSCAT calculations.
- compile and link storage.x with the script compil.sh.
- storage.x reads the molscat.inp file, asks the MXLAM value for
your favorite POTENL routine, and estimates the required
storage for each JTOT, assuming a conventional storage for VL
array.
30 April 2008, George C. McBane.
a) I have completed an OpenMP-parallelized COUPLE for ITYPE=1, 2, and
7. It's probably important only for ITYPE 7; for the other two,
usually the COUPLE times are small anyway.
Some changes in logic even for serial compilation:
1) scratch storage for 3j and 6j symbols allocated in X() at
beginning rather than separate allocations for each potential
term
2) ITYPE=1 zeros out VL at start, then updates only nonzero
elements
3) Redundant assignments of zero IV(I) values removed in
ITYPE=2,7
These changes should not affect numerical results.
My test times on an ITYPE=7 problem with N=504, NPOTL = 25,
MXLAM = 114114 give the following:
threads/CPUs time in COUPLE
1 81.8 s
2 41.0
4 20.6
8 10.3
so for this MXLAM the scaling is nearly perfect as would be
expected.
b) In PMP, I had modified IVCHK so that it would skip the duplicate
check if MXLAM was >1000. That modification makes just as much
sense for serial or SMP code, so I suggest removing the test
against LL_PMP in that subroutine. A better solution, of course,
would be to write a better IVCHK that uses O(M log M) operations
rather than O(M*M) operations, where M=MXLAM. Such a routine
could be based on any good sorting algorithm, but I have not taken
time to write one.
Cumulative update to distribution of 07 April 2008.
The tar file should be unpacked under directory distrib_20080407/
Then issue
$ rm asrot.o cpl4.o dmsym.o ipasym.o set4.o set6.o
$ rm pmp*.o driver.o gelapsed*.o gclock*.o init_build.o
$ make
Implements a couple of suggestions from George C. McBane after
running the code at San Diego Supercomputer center:
- Added a call to flshfo(6) at the end of the DIE subroutine, to
ensure that any error messages actually get written to the log
file before the process terminates. This may be required by some
MPI implementations.
- Added a variant for GCLOCK and GELAPSED timing routines for MPI
calculations. The new routines invoke MPI_Wtime() and provide
wallclock timing information which may be more relevant then cpu
timing information on MPI supercomputers. In addition, these
routines are fully portable under an MPI environment. This
alternative required some adjustments in pmpdyn.f, pmpnew.f and
driver.f to postpone the close out of MPI after all timing
information has been gathered.
To use these routines, put gclock_mpi.o in place of gclock_f95.o
and gelapsed_mpi.o instead of gelapsed.o and timing_.o in the
UTIL macro in the compil_omp_mpi and compil_mpi makefile
sections.
- Reformulated a warning message in init_build.f concerning the
CPU timing statistics for threads.
Cumulative update to distribution of 07 April 2008.
The tar file should be unpacked under directory distrib_20080407/
Then issue
$ rm asrot.o cpl4.o dmsym.o ipasym.o set4.o set6.o
$ make
- Provides a new diagonalisation scheme for asymmetric top levels
(ITYPE=4,6), which fixes an erroneous MOLSCAT output of the
sequence of TAU values for degenerate states. Result of
scattering calculations is not affected because the TAU index is
not used in the build up of rotational couplings.
This minor bug obscured the correspondance with spectroscopic
notation and was pointed out by Annie Spielfiedel (Paris/Meudon
Observatory).
The correspondance between MOLSCAT output and standard spectroscopic
notation is the following:
km1-k1 = tau
km1 = ka = int( (j1+tau+1)/2 )
k1 = kc = int( (j1-tau+1)/2 )
The update introduces a variant of the ASROT routine which calls
the NEWROTDIAG routine (both routines in asrot.f). See more
details in asrot.f, in particular comments in routine
NEWROTDIAG. (The original MOLSCAT behaviour can be restored by
setting the flag NEWALGO to .false. in ASROT.)
- Provides minor updates of cpl4.f, dmsym.f, ipasym.f with
comments or debug features.
- Provides a bug fix limiting the output to IASYMU unit to the
dispatcher task alone in the case of parallel PMP runs.
Affected files: set4.f, set6.f.
Bugfix distribution, unpacks under directory distrib_20080407/
- Fixed a serious bug in pmpnew.f concerning the handling of
MPI_Isend buffers. Invalid (JTOT, M) values could be passed for
calculations on the master node, aborting the MPI run.
- Added a check in driver.f to prevent setting ISIGU in PMP runs.
- Updated compil*.ia64.* files for Itanium.
- Updated files in tst/ subdirectory.
Initial distribution, unpacks under directory distrib/.