Web server DDfit: a new scheme to process PFG NMR diffusion data with improved precision

VladislavA.Salikov1

OlgaO.Lebedenko1

NikolaiR.Skrynnikov1,2✉Emailn.skrynnikov@spbu.ru

IvanS.Podkorytov1✉Emaili.podkorytov@spbu.ru

Laboratory of Biomolecular NMRSt. Petersburg State University199034St. PetersburgRussia

2Department of ChemistryPurdue University47907West LafayetteINUSA

Vladislav A. Salikov¹, Olga O. Lebedenko¹, Nikolai R. Skrynnikov^1,2*, Ivan S. Podkorytov¹*

¹ Laboratory of Biomolecular NMR, St. Petersburg State University, St. Petersburg 199034, Russia

² Department of Chemistry, Purdue University, West Lafayette, IN 47907, USA

* Corresponding authors: n.skrynnikov@spbu.ru, i.podkorytov@spbu.ru

Keywords:

protein diffusion

PFG NMR

stimulated echo

spectral fitting

signal-to-noise optimization

baseline correction

Abstract

In this communication we describe a new scheme to process the data from stimulated echo protein diffusion experiments. For a series of gradient-encoded proton spectra

$\:{f}_{k}\left(\omega\:\right)$

considered over the selected spectral region

$\:({\omega\:}_{left},{\omega\:}_{right})$

, we build a model to approximate the unique (protein-dependent) shape of the spectrum. Taking a cue from the optimal filtration theory,

$\:{f}_{model}\left(\omega\:\right)$

is constructed as the intensity-weighted combination of

$\:{f}_{k}\left(\omega\:\right)$

. The so obtained

$\:{f}_{model}\left(\omega\:\right)$

is then used to fit the individual spectra

$\:{f}_{k}\left(\omega\:\right)$

, thus providing highly accurate estimates for the integral signal intensities that are subsequently used for Stejskal-Tanner-type analyses. This algorithm has been implemented as a part of a new web server, named DDfit (https://ddfit.bio-nmr.spbu.ru/). The server accepts spectrometer data from the standard stimulated and double-stimulated echo experiments by Bruker, as well as custom-designed experiments. The server is easy to use, with data processing taking no more than several seconds. Our tests using simulated as well as experimental data found that DDfit determines protein diffusion coefficients with both accuracy and precision, offering several-fold improvement in precision compared to other processing schemes.

1. Introduction

Pulsed-field gradient nuclear magnetic resonance (PFG-NMR) is extensively used for characterization of macromolecules and macromolecular assemblies. The range of problems addressed by PFG NMR includes protein-ligand interactions (Tillett et al. 1999; Weljie et al. 2003; Lucas et al. 2004; Segev et al. 2005; Dalvit and Vulpetti 2012), protein oligomerization and aggregation (Ilyina et al. 1997; Price et al. 1999; Barhoum and Yethiraj 2010; Kheddo et al. 2016; Rabdano et al. 2017) including applications to amyloidogenesis (Tseng et al. 1999; Li et al. 2005; Narayanan and Reif 2005; Svane et al. 2008; Soong et al. 2009; Waelti et al. 2015; Hoffmann et al. 2018), protein folding (Wilkins et al. 1999; Balbach 2000; Buevich and Baum 2002; Li et al. 2007; Ramanujam et al. 2020), characterization of molecular machines and other supramolecular assemblies (Christodoulou et al. 2004; Baldwin et al. 2011; Huang et al. 2016; Leung et al. 2017; Kitevski-LeBlanc et al. 2018; Kharkov et al. 2021), probing special protein environments (Coffman et al. 1997; Wang et al. 2010; Waudby et al. 2012; Roos et al. 2016; Brady et al. 2017; Wong et al. 2020) and a host of other applications.

Oftentimes, PFG NMR measurements are faced with low signal-to-noise (S/N) ratio. This situation typically arises from low concentration of species of interest in the NMR sample. In turn, this can be a consequence of limited solubility or, otherwise, reflect the fact that sample material is difficult to prepare and/or expensive. Sometimes the situation is exacerbated by a short lifetime of the sample, which limits the experimental acquisition time. Along the same lines, when monitoring the evolution of amyloidogenic samples the data must be acquired over the short time intervals. Furthermore, such experiments often target certain specific species within the evolving mixture (e.g. monomers or low-order oligomers), which means that the actual concentration can be lower than the nominal concentration. Because of spectral overlaps, the observations may be limited to a small number of spectral lines (peaks), which further undermines the sensitivity. Alternatively, one can deliberately choose to work at low protein concentration in order to avoid the effects of obstruction (entanglement), weak dimerization effects, etc. (Wilkins et al. 1999; Liu et al. 2012; Ramanujam et al. 2020). This is particularly important to detect small changes in diffusion coefficient, e.g. due to expansion (contraction) of disordered proteins in response to experimental parameters such as temperature, pressure or pH.

The other factor responsible for low S/N ratio is associated with relaxation losses and relaxation broadening. This is particularly relevant for heavy (supra)molecular species. Bear in mind that signals from such species tend to be broadened beyond detection, such that often one has to rely on special labeling schemes in order to collect NMR data (Kitevski-LeBlanc et al. 2018). In turn, this means that only a limited number of spins (e.g. a subset of methyl sites) contribute to the observed signal, which has the effect of lowering the effective concentration of the sample. Alternatively, relaxation losses can be due to viscous environment, e.g. as a result of crowding, phase separation, etc. Since spin magnetization recovery is slowed down in the slow motion regime, the experiments have to employ long recycling delay, which further degrades S/N ratio. In some other cases, the signals come from a (moderately) flexible portions of a heavy system, e.g. histone tails in nucleosome core particle (Kato et al. 2009). In this situation, the widely used stimulated echo experiments can suffer from a certain amount of relaxation loss during the requisite long diffusion delay

$\:{\Delta\:}$

A substantial effort has been made to address the S/N problem in PFG NMR experiments by means of pulse sequence engineering. In doing so, the investigators used favorable relaxation properties of heteronuclear spin modes (Choy et al. 2002; Ferrage et al. 2003), selective excitation schemes and other designs to speed up magnetization recovery (Augustyniak et al. 2011; Augustyniak et al. 2012; Chan et al. 2015), more efficient phase coding via multiple-quantum coherences (Huang et al. 2017) and TROSY-type schemes to minimize relaxation losses (Horst et al. 2011; Didenko et al. 2011). However, to this day the majority of PFG NMR measurements are conducted on unlabeled samples using the simple and robust stimulated echo (STE) and double-stimulated echo (DSTE) sequences (Stejskal and Tanner 1965; Jerschow and Muller 1997). For these applications, poor S/N ratio often remains a pressing problem.

Here we address this problem by presenting a simple data processing scheme, which allows one to improve the precision of the standard NMR diffusion measurements. In what follows, we will briefly explain the idea behind the proposed scheme (a more detailed description is presented in Materials & Methods section; see also (Kharkov et al. 2021)). For the sake of argument, let us consider an NMR spectrum consisting of a single Lorentzian line that has been recorded with a very poor S/N ratio. In dealing with such data, the optimal processing strategy is to fit the spectrum with a Lorentzian function, integrate the fitted contour and thus obtain the (best attainable) estimate for signal intensity. The benefit to this approach is that we know the shape of the spectral line and utilize this prior knowledge in order to access the information of interest (i.e. to find the signal intensity). Clearly, this is a well-known strategy, which is broadly used in the experimental NMR practice. In particular, this strategy has been implemented in the programs NMRPipe (Delaglio et al. 1995) and FuDA (Hansen et al. 2007), where it is adapted for 2D and 3D spectra. Of note, the algorithm can deal not only with isolated peaks, but also with clusters of two or more peaks (which are treated as a superposition of several multidimensional Lorentzian contours or, depending on data acquisition scheme and window function applied, other types of contours). However, what happens if the spectral shape is complex and cannot easily be represented as a sum of several Lorentzians?

The case in question is a set of 1D spectra

$\:{f}_{k}\left(\omega\:\right)$

from STE or DSTE measurements on a biomacromolecular sample (here subscript

$\:k$

enumerates the spectra collected with different gradient strengths

$\:{G}_{k}$

). In interpreting these data, one usually selects a certain region within the spectrum and determines the integral intensity of the signal over this region,

$\:{I}_{k}$

. Typically, the selected region contains multiple (from tens to hundreds) heavily overlapped spectral lines, giving rise to a rather unique spectral envelope. This envelope cannot be readily or stably fitted using some sort of a generic template function, such as a combination of several Lorentizans. In lieu of such generic template, we seek to build a case-specific function

$\:{f}_{model}\left(\omega\:\right)$

that simply reproduces the relevant portion of the spectral envelope.

A simple way to obtain reasonable

$\:{f}_{model}\left(\omega\:\right)$

is to add all spectra in the STE (DSTE) series,

$\:{f}_{model}\left(\omega\:\right)={\sum\:}_{k}{f}_{k}\left(\omega\:\right)$

. Obviously, this should reduce the amount of noise compared to each individual spectrum

$\:{f}_{k}\left(\omega\:\right)$

. However, on second thought it becomes clear that this strategy is sub-optimal. Indeed, the spectra

$\:{f}_{k}\left(\omega\:\right)$

corresponding to the strongest gradients

$\:{G}_{k}$

usually contain precious little signal. Including these spectra into the sum does not improve

$\:{f}_{model}\left(\omega\:\right)$

, but rather contaminates it with noise. A better strategy is to form the weighted sum of the spectra,

$\:{f}_{model}\left(\omega\:\right)={\sum\:}_{k}{I}_{k}^{\left(0\right)}{f}_{k}\left(\omega\:\right)/{\sum\:}_{k}{I}_{k}^{\left(0\right)}$

, where

$\:{I}_{k}^{\left(0\right)}$

is the zeroth-order approximation for the integral of interest (obtained e.g. by straightforward integration of the

$\:k$

-th spectrum over the selected region). This ansatz is rooted directly in the optimum filtration theory (Ernst et al. 1987), yielding the best possible S/N ratio for

$\:{f}_{model}\left(\omega\:\right)$

. Using the resulting “low-noise” template

$\:{f}_{model}\left(\omega\:\right)$

, we can further fit the individual spectra

$\:{f}_{k}\left(\omega\:\right)$

– and thus obtain more accurate intensities,

$\:{I}_{k}^{\left(1\right)}$

(similar to the familiar fitting strategy using Lorentizan templates).

This new data processing scheme has been implemented as a part of the web server DDfit (Diffusion Data fit, https://ddfit.bio-nmr.spbu.ru/). The server allows one to load the data files in a standard Bruker format, indicate the boundaries of the integration region and in a matter of several seconds obtain the table of

$\:{I}_{k}^{\left(1\right)}$

. If desired, the server can also fit the results using the famous Stejskal-Tanner equation for the STE experiment (Stejskal and Tanner 1965) or its suitably modified version for the DSTE experiment (Jerschow and Muller 1997), produce the graph with the fitted data and report the determined diffusion coefficient. Simple interfaces are available to handle the results of the experiments recorded with the standard Bruker pulse sequences; otherwise, more flexible interfaces are offered to deal with user-designed sequences.

The performance of DDfit has been tested on both simulated and experimental diffusion data. The new scheme is shown to be both accurate and precise, achieving considerably better precision than the standard processing schemes such as offered, for example, in the Topspin package (Zick 2016) or state-of-the-art MestreNova program (Mestrelab 2024).

2. Materials and Methods

Algorithm

Prior to application of the algorithm, the FIDs from diffusion NMR experiment are window multiplied, Fourier transformed and phased. The resulting spectra

$\:{f}_{k}\left(\omega\:\right)$

are inspected to choose a suitable integration region, with bounds

$\:{\omega\:}_{left}$

$\:{\omega\:}_{right}$

. The integration region is chosen using the usual considerations, i.e. it should contain a substantial amount of protein signal, but leave out the areas dominated by noise, residual water signal, spectral lines arising from buffer compounds or low-molecular-weight contaminants, etc. The further procedure consists of five steps:

(i) Linear baseline correction is applied to all spectra

$\:{f}_{k}\left(\omega\:\right)$

using left and right endpoints of the region of interest,

$\:{\omega\:}_{left}$

and

$\:{\omega\:}_{right}$

(ii) The so-obtained spectra

$\:{\stackrel{\sim}{f}}_{k}\left(\omega\:\right)$

are integrated over the interval from

$\:{\omega\:}_{left}$

and

$\:{\omega\:}_{right}$

. The resulting integrals

$\:{I}_{k}^{\left(0\right)}$

suffer from noise, which arises, in particular, from the use of 2-point baseline correction in the previous step.

(iii) By calculating the weighted average of all spectra in a series, we obtain a model spectrum. The weights are proportional to

$\:{I}_{k}^{\left(0\right)}$

, as evaluated in the previous step.

$\:{\stackrel{\sim}{f}}_{model}\left(\omega\:\right)=\frac{{\sum\:}_{k=1}^{N}{I}_{k}^{\left(0\right)}{\stackrel{\sim}{f}}_{k}\left(\omega\:\right)}{{\sum\:}_{k=1}^{N}{I}_{k}^{\left(0\right)}}$

The weighting scheme is based on the idea of matched filtration, which maximizes the signal-to-noise ratio of the resultant sum spectrum (Peebles 2000; Ernst et al. 1987).

(iv) All spectra

$\:{\stackrel{\sim}{f}}_{k}\left(\omega\:\right)$

are approximated using the ansatz

$\:{a}_{k}+{b}_{k}\omega\:+{c}_{k}{\stackrel{\sim}{f}}_{model}\left(\omega\:\right)$

. The term

$\:{a}_{k}+{b}_{k}\omega\:$

is intended to refine the baseline correction performed in step (i). The term

$\:{c}_{k}{\stackrel{\sim}{f}}_{model}\left(\omega\:\right)$

represents the spectrum per se, assuming that each spectrum

$\:{\stackrel{\sim}{f}}_{k}\left(\omega\:\right)$

can be expressed as a scaled version of the model spectrum. The coefficients

$\:{a}_{k}$

$\:{b}_{k}$

, and

$\:{c}_{k}$

are fitted by minimizing the following target function:

$\:{\chi\:}^{2}=\underset{{\omega\:}_{left}}{\overset{{\omega\:}_{right}}{\int\:}}{\left({\stackrel{\sim}{f}}_{k}\left(\omega\:\right)-\left({a}_{k}+{b}_{k}\omega\:+{c}_{k}{\stackrel{\sim}{f}}_{model}\left(\omega\:\right)\right)\right)}^{2}d\omega\:$

In doing so, standard linear least-squares procedures, such as e.g. numpy.linalg.lstsq, can be used. The coefficients

$\:{c}_{k}$

are taken to be the experimental values of

$\:{I}_{k}^{\left(1\right)}$

, i.e. the refined signal intensities that are subsequently used to determine the diffusion coefficient.

Server architecture

The schematic layout of the DDfit server (https://ddfit.bio-nmr.spbu.ru/) is shown in Fig. 1. Currently, it is assumed that diffusion data are acquired on a Bruker spectrometer and stored in Bruker format (in principle, routines exist to convert Varian or JEOL data to Bruker format (Bruker 2020; Helmus and Jaroniec 2013) before processing them with DDfit, but this would likely require some manual editing of the spectrometer files). As a first step, a series of FIDs from the diffusion experiment are pre-processed using Bruker’s TopSpin software, viz. multiplied by a window function, Fourier-transformed and phased. The resulting spectral data are transferred to DDfit in a form of standard 2rr file. Also transmitted to DDfit are procs file (contains several parameters of interest, such as spectrometer operating frequency and spectral width), proc2s file (supports reading of the 2rr matrix by the nmrglue accessory (Helmus and Jaroniec 2013)), and difflist file (contains a list of gradient strengths

$\:{G}_{k}$

expressed in the units of G/cm). The gradient strengths in difflist are for rectangular gradient shapes (in the case of shaped gradients, the shape factor is factored in

$\:{G}_{k}$

* For DSTE experiments, the results are interpreted using Stejskal-Tanner equation with Jerschow-Muller modifications.

Figure 1. Block diagram illustrating the processing of NMR diffusion data by TopSpin and DDfit.

The main page of the DDfit website offers user a choice between STE and DSTE experiments; besides, an option to determine signal intensities irrespective of the experimental details is also available. For both STE and DSTE, the user can select one of the standard Bruker experiments: stegp1s for STE using simple monopolar gradients, stebpgp1s for STE using bipolar gradients, dstegp3s for DSTE using monopolar gradients, or dstebpgp3s for DSTE using bipolar gradients. Otherwise, non-standard (in-house) variants of the same pulse sequences can also be processed.

For any specific choice of the pulse sequence, the user is presented with a small web form where he/she needs to enter several parameters pertaining to this particular experimental setup. For example, in the case of dstegp3s experiment one has to indicate the observation spin (selected from a dropdown menu, typically ¹H), the length of the gradient pulse p30, the duration of the diffusion delay d20, the length of the hard 90° pulse p1 and the duration of the gradient recovery delay d16. Likewise, for the other three Bruker sequences the user must enter several relevant parameters specific to the chosen experiment.

In the case of non-standard STE or DSTE sequences, the user is presented with a generalized scheme of the sequence which contains generic names for pulses and delays. The idea is that the experimentalist should make a connection between the generic variables and the actual parameters of the experiment. The so-defined variables are then entered into the web form associated with this particular type of pulse sequence.

Finally, the user also needs to indicate the boundaries of the spectral region to be used for signal intensity determination,

$\:{\omega\:}_{left}$

and

$\:{\omega\:}_{right}$

. The user can choose to enter these values in the units of ppm or otherwise in points. The criteria to choose a suitable spectral region are described in the previous section.

Once all input parameters are entered, the job is submitted for execution on a dedicated DDFit server. As a first step, the series of spectra are fitted using the optimal model

$\:{\stackrel{\sim}{f}}_{model}\left(\omega\:\right)$

as described in the previous section. If user request is to evaluate signal intensities (“integrals only” panel), the program generates a table of intensities,

$\:{I}_{k}^{\left(1\right)}\left({G}_{k}\right)$

, and stops there. The table can be displayed for visual inspection or downloaded as an ascii file. As a part of the output, the server also generates an image of the spectrum (the one acquired under the lowest gradient strength), where the integration region is highlighted in red. The server also produces another image, showing a stack of experimental spectra (following baseline correction, restricted to the integration region). In addition, it offers a basic interactive plotting facility, allowing one to zoom into selected regions of the spectrum, adjust the plotting scale, etc. All images can be conveniently saved in various graphics formats. The user can generate a unique link to the results page and store it for future reference.

Otherwise, if user request is to analyze the diffusion data (“stimulated echo” or “double-stimulated echo” panels), the program additionally fits the obtained

$\:{I}_{k}^{\left(1\right)}\left({G}_{k}\right)$

data to Stejskal-Tanner equation or Stejskal-Tanner equation with Jerschow-Muller modifications. In this case, DDfit additionally displays a graph illustrating the data fitting and reports the determined diffusion coefficient

$\:D$

together with the standard error derived from the covariance matrix. This part of DDfit service is trivial and uses a standard minimization routine. All calculations, including spectral intensity determination and Stejskal-Tanner fitting, take no more than several seconds.

DDfit also offers a demo page, demonstrating the analyses of DSTE data acquired in-house for a sample of hen egg-white lysozyme (see next section). All spectrometer files in the demo package are downloadable. Finally, there is also a help page, providing the description of the algorithm, usage instructions, and a review of stimulated echo sequences.

The DDfit server has been built using Python Flask framework, with Nginx as a frontend and uWSGI as a gateway interface. The application makes use of several libraries: nmrglue (reading of spectrometer data), scipy (minimization), matplotlib (graphics), Bokeh (interactive graphics) and uuid module in Python (generation of unique webpage identifiers). The server is hosted on a desktop PC equipped with Intel i5-4440 CPU and 16 GB of RAM running Ubuntu 24.04.3.

Synthetic diffusion data

DDfit algorithm was first tested on synthetic diffusion data. Synthetic FIDs have been generated for ubiquitin using the simulation platform Spinach 2.6.5625 (Edwards et al. 2014). The list of proton chemical shifts was from the Spinach demo file 1d3z.bmrb. The spectrometer operating frequency was assumed to be 500 MHz and the spectral width was taken to be 15 ppm. For simplicity, we turned off spin relaxation in Spinach simulations; instead, a uniform line broadening was introduced by applying the exponential multiply window function to the simulated FID (LB of 15 Hz). To simulate the effect of translational diffusion, 15 copies of ubiquitin FID have been multiplied by Stejskal-Tanner factors:

$\:{{\uplambda\:}}_{k}=\text{exp}\left(-{\left(\gamma\:{G}_{k}\delta\:\right)}^{2}\left({\Delta\:}-\frac{1}{3}\delta\:\right)D\right)$

where

$\:\gamma\:$

is a proton gyromagnetic ratio,

$\:{\Delta\:}$

is the diffusion delay (100 ms),

$\:\delta\:$

is the duration of rectangular gradient pulse (5.4 ms),

$\:{G}_{k}$

is a series of fifteen gradient strengths uniformly distributed between 0.95 and 35.1 G/cm and

$\:D$

is the translational coefficient of ubiquitin taken to be 11.2×10^− 7 cm² s^− 1 (Ramanujam et al. 2020).

Separately, we have generated an FID of solvent water resonating at 4.7 ppm. The broadening LB of the water signal was set to 2 Hz and the diffusion coefficient was taken to be 181×10^− 7 cm² s^− 1 corresponding to HDO diffusion in D₂O (Stefaniuk et al. 2022), see below. The protein and water FIDs were then combined assuming that the sample was prepared with 100 µM ubiquitin in D₂O solvent with 2% residual H₂O content. In doing so, we assumed that highly labile protons in ubiquitin were fully displaced with deuterons while amide protons were retained (the latter assumption is inconsequential since only the aliphatic portion of the simulated spectrum is further used for signal evaluation). The synthetic diffusion data were thus prepared in a form of 2D complex matrix with dimensions 15×4096.

As a next step, we have generated

$\:N$

=16,383 complex noise matrices of the same size, 15×4096. The matrices contained uncorrelated Gaussian noise with the zero mean; this is a sufficiently accurate representation of spectrometer noise associated with quadrature detection (Grage and Akke 2003). The magnitude of the noise was set to a small fraction of the protein FID amplitude, leading to the average S/N of 65 in the noised 1D spectra in the absence of gradient or otherwise 63 under the weakest

$\:{G}_{k}$

gradient (where S/N is defined as the ratio of the maximum amplitude of the protein signal in the spectrum

$\:\stackrel{\sim}{f}\left(\omega\:\right)$

and the root-mean-square amplitude of the noise). The so-obtained

$\:N$

realizations of the noised gradient-encoded diffusion data were then processed using the DDfit algorithm as further detailed in the Results section.

Experimental diffusion data

DDfit algorithm has also been tested on experimental diffusion data acquired in-house from a sample of 0.5 mM hen egg-white lysozyme (HEWL, Sigma-Aldrich). The protein was dissolved in D₂O-based solvent (0.01% NaN₃, 5% dioxane, pH 2.0) and the data were collected at 15°C; under these conditions the sample remains folded and stable (Mariño et al. 2015). The measurements were conducted on Bruker Avance III 500 MHz spectrometer equipped with BBI probe using double-stimulated echo pulse sequence (dstebpgp3s). The diffusion delay

$\:{\Delta\:}$

was 200 ms and the length of bipolar gradient pulse

$\:\delta\:$

was 5 ms (two pulses with an opposite polarity of 2.5 ms each, smoothed square shape SMSQ10.100, integral shape factor 0.9). The data were recorded with twenty gradient strengths

$\:{G}_{k}$

, incremented linearly from 0.95 to 46.53 G/cm; only twelve points, from 0.95 to 28.01 G/cm have been used in data analyses. When recording the data, the gradients were arranged in pseudo-random order. The number of scans was 16 (a minimum number to accommodate the phase cycle). In total, 77 repeat diffusion experiments have been recorded back-to-back in 64 hours.

3. Results and discussion

Tests using simulated data

We begin by exploring different data processing schemes using the simulated STE dataset as described in Materials and Methods. In brief, this is the series of gradient-encoded 1D proton spectra, representing a sample of ubiquitin in D₂O with residual content of H₂O. The methyl portion of the spectra, from

$\:{\omega\:}_{left}$

= 1.33 ppm to

$\:{\omega\:}_{right}$

= 0.30 ppm, is used for signal evaluation. This region offers a highest amount of protein signal while it is also maximally distant from the water resonance. The analyses are conducted for

$\:N$

= 16,383 replicas of the simulated STE dataset containing a moderate amount of random Gaussian noise. The data were pre-processed using EM window function with line broadening of 15 Hz (matched to the simulated homogeneous line broadening). The average S/N ratio of the simulated spectra within the chosen region (

$\:{\omega\:}_{left},{\omega\:}_{right}$

) under the weakest

$\:{G}_{k}$

gradient is 63.

As a first step, we test the simplified processing scheme where we do not perform any baseline correction, but rather directly proceed to spectra integration. The treatment, which involves numeric integration of the spectra within (

$\:{\omega\:}_{left},{\omega\:}_{right}$

) interval followed by Stejskal-Tanner fitting of the so-obtained integrals, has been repeated

$\:N$

times in a Monte-Carlo style simulation. The fitted values of diffusion coefficient for ubiquitin,

$\:{D}_{fit}$

, are summarized in Fig. 2a. We further selected the dataset corresponding to the median of the obtained

$\:{D}_{fit}$

distribution; for this particular dataset we present the stack of the simulated STE spectra plotted in the interval from

$\:{\omega\:}_{left}$

$\:{\omega\:}_{right}$

, Fig. 2b, and the results of Stejskal-Tanner fitting, Fig. 2c.

Fig. 2

Analyses of the simulated STE data using different processing schemes. Top row: no baseline correction, direct numeric integration of the spectra in the specified region, from 1.33 to 0.30 ppm. (A) Histogram of

$\:{D}_{fit}$

values obtained from Stejskal-Tanner analyses of 16,383 simulated datasets. Black vertical line indicates that target value of

$\:D$

. (B) Representative dataset corresponding to the median

$\:{D}_{fit}$

value: a series of simulated STE spectra plotted in the interval from 1.33 to 0.30 ppm. (C) Stejskal-Tanner fit of the integral signals obtained from the dataset shown in panel (B). Middle row: 2-point baseline correction, direct numeric integration of the spectra. (D) Histogram of

$\:{D}_{fit}$

values. (E) Representative dataset corresponding to the median

$\:{D}_{fit}$

value. (F) Stejskal-Tanner fit of the integral signals

$\:{I}_{k}^{\left(0\right)}$

from the dataset (E). Bottom row: 2-point baseline correction, model-based integration as implemented in DDfit. (G) Histogram of

$\:{D}_{fit}$

values. (H) Representative dataset corresponding to the median

$\:{D}_{fit}$

value, after DDfit treatment. Shown is the model spectrum

$\:{\stackrel{\sim}{f}}_{model}\left(\omega\:\right)$

(thick red line) and the series of fitted spectra

$\:{c}_{k}{\stackrel{\sim}{f}}_{model}\left(\omega\:\right)$

(thin colored lines). (I) Stejskal-Tanner fit of the integral signals

$\:{I}_{k}^{\left(1\right)}$

obtained from the dataset (H).

The inspection of Fig. 2a immediately indicates that

$\:{D}_{fit}$

values obtained from this procedure are systematically overestimated compared to the target

$\:D$

value encoded in the simulated data (the histogram is shifted to the right from the vertical line, which indicates the target). The reason for this apparent bias is quite obvious since the integrated signal contains a contribution from the (extending far upfield) wing of the residual water signal. As a consequence, the diffusion decay profile in Fig. 2c consists of the two components: the dominant contribution from the slowly diffusing protein and a minor contribution from the rapidly diffusing water (affecting mainly the initial points of the diffusion decay profile). Predictably, fitting this combination with the Stejskal-Tanner equation equipped with the single parameter

$\:D$

results in overestimation of the diffusion coefficient.

Specifically, the simulations yield (11.73 ± 0.17)×10^− 7 cm² s^− 1 (here and in what follows we report the average

$\:{D}_{fit}$

value together with the standard deviation from

$\:{D}_{fit}$

histogram such as illustrated in Fig. 2a). This is 5% overestimated relative to the target value of 11.20×10^− 7 cm² s^− 1. Given that diffusion coefficient is inversely proportional to the cubic root of molecular mass,

$\:D\sim{\left(m\right)}^{-1/3}$

, this effect corresponds to a roughly 15% loss in the estimated mass of the protein. Under certain circumstances the error on this scale can lead to erroneous conclusions. For example, it may be misinterpreted as the evidence of protein compaction, viz. transition from a strongly disordered state to a molten globule.

In summary, the most basic data processing strategy, which focuses on a methyl region far away from the water signal and which neglects to perform any baseline correction, yields the results that are highly reproducible (i.e. precise), but at the same time are biased (i.e. inaccurate).

As a next step, we have implemented the protocol involving the simplest baseline correction scheme – namely, 2-point baseline correction. Specifically, the baseline is approximated by a straight line connecting a pair of points,

$\:({\omega\:}_{left},{f}_{k}({\omega\:}_{left}\left)\right)$

and

$\:({\omega\:}_{right},{f}_{k}({\omega\:}_{right}\left)\right)$

. After the base is subtracted, the signal is numerically integrated same as before.

The outcome of the Monte-Carlo analyses for this protocol is illustrated in Figs. 2d-f, showing the

$\:{D}_{fit}$

histogram, the stack of simulated STE spectra corresponding to the median

$\:{D}_{fit}$

value, and the Stejskal-Tanner fit for this particular dataset. Obviously, application of the 2-point baseline correction scheme eliminates bias in determination of the diffusion coefficient – the

$\:{D}_{fit}$

histogram is now centered at the vertical line which marks the target value of

$\:D$

. At the same time the width of the distribution is significantly increased, indicating greater uncertainty. The mean value of

$\:{D}_{fit}$

now amounts to (11.19 ± 0.57)×10^− 7 cm² s^− 1, which is exactly on target, but with three-fold increased uncertainty range.

These observations can be readily rationalized. Indeed, 2-point baseline correction takes out the solvent contribution to the spectrum (along with a portion of protein signal), thus eliminating the bias toward fast diffusion. It is easy to see that this scheme works well only when the solvent spectrum

$\:{f}_{k}^{solvent}\left(\omega\:\right)$

within the region of interest, from

$\:{\omega\:}_{left}$

$\:{\omega\:}_{right}$

, can be accurately approximated by a straight line. This is indeed the case for the wing of the water signal since the selected region (from 1.33 to 0.30 ppm) corresponds to a relatively narrow band far removed from the water resonance (4.7 ppm). Generally, this strategy should work best when choosing a narrow integration region far away from strong solvent signals.

At the same time, 2-point baseline correction scheme suffers a setback in that the magnitude of correction strongly depends on random noise contained in the two spectral points,

$\:{f}_{k}\left({\omega\:}_{left}\right)$

and

$\:{f}_{k}\left({\omega\:}_{right}\right)$

. In other words, there is no mechanism to average out noise in this simple procedure (more on this in what follows). In turn, this leads to significant random variations in obtained

$\:{D}_{fit}$

values.

As a result, the improved data processing strategy, employing 2-point baseline correction scheme, produces the results that are accurate, but not very precise.

Finally, we illustrate the performance of the scheme at the core of the DDfit algorithm, Figs. 2g-i. Same as before, the treatment begins with a simple 2-point baseline correction followed by the calculation of numeric integrals,

$\:{I}_{k}^{\left(0\right)}$

. After that the procedure takes a different course. The integrals

$\:{I}_{k}^{\left(0\right)}$

are used as weights in calculating of the model spectrum

$\:{\stackrel{\sim}{f}}_{model}\left(\omega\:\right)$

with optimal signal-to-noise properties, see Eq. (1). The construct

$\:{a}_{k}+{b}_{k}\omega\:+{c}_{k}{\stackrel{\sim}{f}}_{model}\left(\omega\:\right)$

is then used to fit the individual spectra,

$\:{\stackrel{\sim}{f}}_{k}\left(\omega\:\right)$

, see Eq. (2). In this construct, the term

$\:{a}_{k}+{b}_{k}\omega\:$

absorbs the noise-induced error from the 2-point baseline correction, while the coefficient

$\:{c}_{k}$

captures the overall intensity of the protein signal, equivalent to the refined integral

$\:{I}_{k}^{\left(1\right)}$

As can be seen from Fig. 2g, this method is bias-free and at the same time relatively insensitive to random noise in the simulated data. The mean value of

$\:{D}_{fit}$

, (11.19 ± 0.27)×10^− 7 cm² s^− 1, perfectly well reproduces the target, 11.20×10^− 7 cm² s^− 1, with two-fold decrease in uncertainty compared to the previously discussed scheme, cf. Figure 2d.

In Fig. 2h, we focus on the simulated dataset corresponding to the median

$\:{D}_{fit}$

value. Shown in the panel is the model spectrum (thick red line), as well as the series of the fitted spectra

$\:{c}_{k}{\stackrel{\sim}{f}}_{model}\left(\omega\:\right)$

(thin colored lines). The Stejskal-Tanner fit of the so-obtained integrals, Fig. 2i, shows only a very limited amount of scatter, significantly less than previously observed in Fig. 2f.

In conclusion, in our simulated experiments the DDfit algorithm proves to be highly accurate and, at the same time, reasonably precise.

At this point one may ask if a more sophisticated baseline correction scheme could improve the situation and, perhaps, obviate the need in the DDfit scheme. As it appears, the short answer to this question is “no”. In fact, some highly effective baseline correction schemes have been developed for NMR studies of small molecules and their mixtures – particularly, in the context of ¹³C spectroscopy (Carlos Cobas et al. 2006; Xi and Rocke 2008). In this latter case, however, the spectrum consists of a limited number of sharp signals with little or no interference from solvent; the baseline is well sampled throughout the spectrum and typically has a simple shape (characteristic baseline roll) which can be readily captured as a part of the baseline removal procedure. In contrast, ¹H spectra of proteins are cluttered with many (significantly broadened) spectral lines. The baseline is dominated by a residual water signal, which extends far up- and downfield from the H₂O resonance frequency, has rather irregular shape and is gradient-dependent. Other major components of the buffer solution can also make a contribution to the baseline. The baseline is well sampled only at the edges of the spectrum, but cannot be directly traced between 0 and 9 ppm. All of this makes it essentially impossible to develop an effective baseline removal scheme for use in proton-based PFG NMR experiments on protein samples.

To demonstrate this point, we have tested the spline baseline correction scheme, which is one of the few more advanced options in the Bruker TopSpin repertoire. On some occasions this scheme has been actually used to process PFG NMR data from protein samples (Augé et al. 2009; Demers and Mittermaier 2009). The results of our tests are summarized in Fig. S1. In the situation where the baseline cannot be sampled well (discussed above), splines prove to be highly sensitive to noise, i.e. unstable. As a result, spline-based procedure has an effect of amplifying noise, yielding mean

$\:{D}_{fit}$

of (11.46 ± 2.70)×10^− 7 cm² s^− 1. This is a staggering ten-fold jump in uncertainty compared to the DDfit treatment. We have also observed similar unsatisfactory performance when using other advanced baseline correction methods, such as Whittaker smoother algorithm (Cobas 2018) available in MestreNova (Willcott 2009) (results not shown). The fundamental reason for this lack of success is the nature of protein ¹H PFG NMR spectra, as described above.

Tests using experimental data

To test the performance of DDfit relative to other data processing schemes on experimental data we have recorded 77 back-to-back DSTE experiments on a sample of 0.5 mM hen egg-white lysozyme (see Materials and Methods). Two spectral regions have been selected for signal integration: region I, extending from 2.23 to -0.05 ppm, which is the portion of the spectrum with the highest signal intensity (S/N ratio of 109), and region II, extending from − 0.05 to -0.34 ppm, featuring four partially overlapped upfield-shifted methyl resonances belonging to residues Ile-98, Met-105, Leu-8 and Leu-17 (S/N ratio of 34) (Dobson et al. 1984; Redfield and Dobson 1988; Maeno et al. 2009).

Figure 3 illustrates three different signal-processing schemes as applied to the experimental datasets, spectral region I. The treatment neglecting baseline correction is clearly a failure, resulting in distorted

$\:{I}_{k}^{\left(0\right)}\left({G}_{k}\right)$

profile, see Fig. 3c. As it happens, the data acquired at lower gradient strengths, i.e. in the presence of sizeable residual water signal, suffer from a persistent artefact in a form of lowered baseline (noticeable toward the right edge of Fig. 3b). This leads to underestimation of the respective integrals, causing characteristic distortions in the diffusion profile. The effect is responsible for the strong bias in the obtained

$\:{D}_{fit}$

values, see Fig. 3a, as well as their poor reproducibility,

$\:{D}_{fit}$

= (3.49 ± 0.47)×10^− 7 cm² s^− 1. While in the case of (idealized) simulated data one can forego the baseline correction step with relatively little consequence, cf. Figure 2a-c, this omission becomes a major issue when dealing with the real-life data.

The situation is rescued by using a simple 2-point baseline correction scheme. The baseline-corrected spectra no longer suffer from downward (or upward) shifts, see Fig. 3e, and the

$\:{I}_{k}^{\left(0\right)}\left({G}_{k}\right)$

profiles, Fig. 3f, are free from the previously observed systematic distortions. The

$\:{D}_{fit}$

values obtained from 77 back-to-back experiments produce the expected Gaussian-shaped distribution, Fig. 3d, with the mean value of (7.64 ± 0.21)×10^− 7 cm² s^− 1. This result is compatible with the available literature-based estimates. Specifically, STE results from Arata laboratory (Price et al. 1999) (288 K, pH 3, 1.5 mM lysozyme in 90% H₂O – 10% D₂O solvent) after correction for solvent viscosity and protein concentration yield

$\:D$

≈ 6.7×10⁻⁷ cm² s⁻¹, whereas similar STE results from Byrd laboratory (Altieri et al. 1995) (298 K, pH 2.3, 2 mM lysozyme in D₂O-based solvent) after correction for temperature and protein concentration yield

$\:D$

≈ 8.3×10⁻⁷ cm² s⁻¹. The two estimates, therefore, straddle the result obtained in this study.

Fig. 3

Analyses of the experimental DSTE data from the sample of HEWL using different processing schemes. Top row: no baseline correction, direct numeric integration of the spectra in the specified region, from 2.23 to -0.05 ppm (region I). (A) Histogram of

$\:{D}_{fit}$

values obtained from Stejskal-Tanner-Jerschow-Muller analyses of 77 replicate datasets. (B) Representative dataset corresponding to the median

$\:{D}_{fit}$

value. (C) Stejskal-Tanner-Jerschow-Muller fit of the integral signals from the dataset shown in panel (B). Middle row: 2-point baseline correction, direct numeric integration of the spectra. (D) Histogram of

$\:{D}_{fit}$

values. (E) Representative dataset corresponding to the median

$\:{D}_{fit}$

value. (F) Stejskal-Tanner-Jerschow-Muller fit of the integral signals

$\:{I}_{k}^{\left(0\right)}$

from the dataset (E). Bottom row: 2-point baseline correction, model-based integration as implemented in DDfit. (G) Histogram of

$\:{D}_{fit}$

values. (H) Representative dataset corresponding to the median

$\:{D}_{fit}$

value, after DDfit treatment. Shown is the model spectrum

$\:{\stackrel{\sim}{f}}_{model}\left(\omega\:\right)$

(thick red line) and the series of fitted spectra

$\:{c}_{k}{\stackrel{\sim}{f}}_{model}\left(\omega\:\right)$

(thin colored lines). (I) Stejskal-Tanner-Jerschow-Muller fit of the integral signals

$\:{I}_{k}^{\left(1\right)}$

obtained from the dataset (H).

The situation is further improved by using our DDfit processing scheme, Fig. 3g-i. In this case, the

$\:{D}_{fit}$

values determined from seventy-seven consecutive DSTE experiments prove to be highly reproducible, with the average of (7.65 ± 0.05)×10^− 7 cm² s^− 1. Thus, the use of DDfit scheme leads to a 4-fold reduction in error margin compared to the conventional processing strategy, i.e. 2-point baseline correction followed by signal integration. Reiterating our conclusion from the analyses of the simulated data, we observe that DDfit algorithm leads to both accurate and precise results also in the case of experimentally measured data.

As a sidenote, we also tested an alternative baseline correction scheme. Same as in the previous section, we used splines for baseline correction followed by direct integration of the signal in the (

$\:{\omega\:}_{left},{\omega\:}_{right}$

) interval. While spline-approximated baseline for a given (high-intensity) spectrum is visually reasonable, see Fig. S2, the scheme is unstable and, as already pointed out, leads to noise amplification. The mean

$\:{D}_{fit}$

value from this method is (7.72 ± 0.60)×10^− 7 cm² s^− 1, signifying 12-fold increase in the magnitude of error compared to the DDfit scheme.

Finally, we have performed the same series of tests focusing on region II in the experimental spectra, where the signal is 25-fold weaker than in region I (referring to the integral intensities after 2-point baseline correction). The results are summarized in Fig. S3. In brief, the treatment neglecting baseline correction is a complete failure (compromised by variable baseline shifts as discussed above). The problem goes away when baseline correction is applied. The scheme using simple 2-point baseline correction followed by direct integration yields

$\:{D}_{fit}$

= (7.52 ± 0.69)×10^− 7 cm² s^− 1, whereas the DDfit treatment using 2-point baseline correction followed by model-based integration yields

$\:{D}_{fit}$

= (7.43 ± 0.29)×10^− 7 cm² s^− 1. The results are consistent with each other and also with those previously obtained from region I (see above). Once again, the precision of the DDfit algorithm is considerably better than that of the standard data processing scheme, with 2.4-fold improvement in the confidence interval.

4. Concluding remarks

Model-based integration of spectral signals is generally a powerful approach, utilizing certain prior information about the characteristics of these signals. For example, the popular NMRPipe package (Delaglio et al. 1995) includes options to treat a stack of 2D spectra, assuming that a number of parameters for a selected spectral peak are a given and are reproduced across the series of spectra (e.g. ¹H^N chemical shift, ¹⁵N chemical shift, and the peak shape). Furthermore, certain assumptions can usually be made about the peak shape, e.g. in solution-state spectra the shapes are typically Lorentzian or, otherwise, can be described by a convolution of Lorentzian contour with the Fourier transform of the (known) window function (Van Horn et al. 2010). NMRPipe routine that utilizes such prior information has been widely used to process data from relaxation experiments, J-coupling and residual dipolar coupling measurements, exchange measurements and other types of experiments (Xue et al. 2014; Wang et al. 2021; Yuwen et al. 2018; Maltsev et al. 2012; Ward and Skrynnikov 2012; Chevelkov et al. 2010; Wylie et al. 2011).

The distinction of the protein diffusion experiments, as discussed in this paper, is that there is no a priori model for the spectral signal of interest (such as Lorentzian function for an isolated spectral peak). Instead, we deal with a sample-specific spectral pattern

$\:f\left(\omega\:\right)$

, which consists of many overlapped spectral lines that fall within the region of interest, (

$\:{\omega\:}_{left},{\omega\:}_{right}$

). To address this problem, we have developed a recipe to calculate

$\:{f}_{model}\left(\omega\:\right)$

as an intensity-weighted sum of the gradient-encoded

$\:{f}_{k}\left(\omega\:\right)$

spectra. This recipe is rooted in the optimal filtration theory and ensures the best quality of the model. Armed with

$\:{f}_{model}\left(\omega\:\right)$

, the algorithm performs a model-based integration of

$\:{f}_{k}\left(\omega\:\right)$

spectra, arriving at uniquely precise results.

The obvious assumption here is that the spectral shape remains unchanged, up to a scaling factor, across the series of diffusion spectra

$\:{f}_{k}\left(\omega\:\right)$

. This is certainly justified for samples that contain only one sort of protein species. This assumption, however, becomes questionable for mixtures (e.g. fully developed monomer-dimer equilibrium) and ought to be tested before our protocol is used.

One should also bear in mind that any processing scheme for PFG NMR diffusion data should necessarily include the baseline correction step. We have investigated this aspect in some detail and found that the simple 2-point baseline correction works best for both standard processing schemes and our new algorithm. In contrast, more sophisticated variants, such as spline-based corrections, lead to significant noise amplification.

With these caveats in mind, our algorithm offers an efficient tool to analyze diffusion data from proteins and other large biomolecules. The method has been implemented in a form of web server DDFit, accessible at https://ddfit.bio-nmr.spbu.ru/. The server is easy to use for standard Bruker-based experiments as well as custom-built sequences, with calculations taking no more than several seconds. We anticipate that this user-friendly web application, which determines protein diffusion coefficients with much better precision than other existing programmatic solutions, will become a useful addition to the ever-expanding bio-NMR toolchest.

Supplementary Information

The online version contains supplementary material available at https://...

Acknowledgement

The study has used the facilities of the Center for Magnetic Resonance (where we would like to acknowledge the assistance of M.A. Vovk and A.S. Mazur), Center for Chemical Analysis & Materials Research, as well as Computing Center at SPbU. N.R.S and I.S.P would like to express their shared strong wish for peace in Ukraine.

Author Contribution

V.A.S. conducted experimental measurements, designed and implemented the web server, and wrote the initial draft of the manuscript. O.O.L. maintained the web server and implemented a number of enhancements. N.R.S. supervised the project and wrote the manuscript. I.S.P. devised the algorithm, conducted experimental measurements and processed the data, designed and implemented the simulations, prepared the graphics and took part in writing the manuscript.

Funding

This work was supported by the grant 127400408 from SPbU

Data Availability

DDFit server offers full access to the dataset from one of the DSTE experiments described in the paper (under the Demo tab). All other experimental and simulated data are available from the authors upon request.

Declarations

Competing interests

The authors declare no competing interests.

Electronic Supplementary Material

Below is the link to the electronic supplementary material

Supplementary Material 1

References

Altieri AS, Hinton DP, Byrd RA (1995) Association of biomolecular systems via pulsed-field gradient NMR self-diffusion measurements. J Am Chem Soc 117:7566–7567

Augé S, Schmit P-O, Crutchfield CA, Islam MT, Harris DJ, Durand E, Clemancey M, Quoineaud A-A, Lancelin J-M, Prigent Y, Taulelle F, Delsuc M-A (2009) NMR Measure of Translational Diffusion and Fractal Dimension. Application Mol Mass Meas J Phys Chem B 113:1914–1918. http://dx.doi.org/10.1021/jp8094424

Augustyniak R, Ferrage F, Damblon C, Bodenhausen G, Pelupessy P (2012) Efficient determination of diffusion coefficients by monitoring transport during recovery delays in NMR. Chem Commun 48:5307–5309. http://dx.doi.org/10.1039/c2cc30578j

Augustyniak R, Ferrage F, Paquin R, Lequin O, Bodenhausen G (2011) Methods to determine slow diffusion coefficients of biomolecules. Applications to Engrailed 2, a partially disordered protein. J Biomol NMR 50:209–218. http://dx.doi.org/10.1007/s10858-011-9510-8

Balbach J (2000) Compaction during protein folding studied by real-time NMR diffusion experiments. J Am Chem Soc 122:5887–5888. http://dx.doi.org/10.1021/ja994514d

Baldwin AJ, Hilton GR, Lioe H, Bagneris C, Benesch JLP, Kay LE (2011) Quaternary Dynamics of alpha B-Crystallin as a Direct Consequence of Localised Tertiary Fluctuations in the C-Terminus. J Mol Biol 413:310–320. http://dx.doi.org/10.1016/j.jmb.2011.07.017

Barhoum S, Yethiraj A (2010) NMR Detection of an Equilibrium Phase Consisting of Monomers and Clusters in Concentrated Lysozyme Solutions. J Phys Chem B 114:17062–17067. http://dx.doi.org/10.1021/jp108995k

Brady JP, Farber PJ, Sekhar A, Lin Y-H, Huang R, Bah A, Nott TJ, Chan HS, Baldwin AJ, Forman-Kay JD, Kay LE (2017) Structural and hydrodynamic properties of an intrinsically disordered region of a germ cell-specific protein on phase separation. Proc. Natl. Acad. Sci. USA 114:E8194-E8203. http://dx.doi.org/doi:10.1073/pnas.1706197114

Bruker (2020) TopSpin. User Manual. Version 012. Bruker Corporation, Billerica MA, USA

Buevich AV, Baum J (2002) Residue-specific real-time NMR diffusion experiments define the association states of proteins during folding. J Am Chem Soc 124:7156–7162. http://dx.doi.org/10.1021/ja012699u

Carlos Cobas J, Bernstein MA, Martín-Pastor M, Tahoces PG (2006) A new general-purpose fully automatic baseline-correction procedure for 1D and 2D NMR data. J Magn Reson 183:145–151. http://dx.doi.org/https://doi.org/10.1016/j.jmr.2006.07.013

Chan SHS, Waudby CA, Cassaignau AME, Cabrita LD, Christodoulou J (2015) Increasing the sensitivity of NMR diffusion measurements by paramagnetic longitudinal relaxation enhancement, with application to ribosome-nascent chain complexes. J Biomol NMR 63:151–163. http://dx.doi.org/10.1007/s10858-015-9968-x

Chevelkov V, Xue Y, Rao DK, Forman-Kay JD, Skrynnikov NR (2010) ¹⁵N^H/D-SOLEXSY experiment for accurate measurement of amide solvent exchange rates. Application to denatured drkN SH3. J Biomol NMR 46:227–244

Choy WY, Mulder FAA, Crowhurst KA, Muhandiram DR, Millett IS, Doniach S, Forman-Kay JD, Kay LE (2002) Distribution of molecular size within an unfolded state ensemble using small-angle X-ray scattering and pulse field gradient NMR techniques. J Mol Biol 316:101–112. http://dx.doi.org/10.1006/jmbi.2001.5328

Christodoulou J, Larsson G, Fucini P, Connell SR, Pertinhez TA, Hanson CL, Redfield C, Nierhaus KH, Robinson CV, Schleucher J, Dobson CM (2004) Heteronuclear NMR investigations of dynamic regions of intact Escherichia coli ribosomes. Proc. Natl. Acad. Sci. USA 101:10949–10954. http://dx.doi.org/10.1073/pnas.0400928101

Cobas C (2018) Applications of the Whittaker smoother in NMR spectroscopy. Magn. Reson Chem 56:1140–1148. http://dx.doi.org/10.1002/mrc.4747

Coffman JL, Lightfoot EN, Root TW (1997) Protein diffusion in porous chromatographic media studied by proton and fluorine PFG-NMR. J Phys Chem B 101:2218–2223. http://dx.doi.org/10.1021/jp962585i

Dalvit C, Vulpetti A (2012) Technical and practical aspects of 19F NMR-based screening: toward sensitive high-throughput screening with rapid deconvolution. Magn. Reson Chem 50:592–597. http://dx.doi.org/10.1002/mrc.3842

Delaglio F, Grzesiek S, Vuister GW, Zhu G, Pfeifer J, Bax A (1995) NMRPipe: a multidimensional spectral processing system based on unix pipes. J Biomol NMR 6:277–293

Demers JP, Mittermaier A (2009) Binding mechanism of an SH3 domain studied by NMR and ITC. J Am Chem Soc 131:4355–4367

Didenko T, Boelens R, Rudiger SGD (2011) 3D DOSY-TROSY to determine the translational diffusion coefficient of large protein complexes. Protein Eng Des Sel 24:99–103. http://dx.doi.org/10.1093/protein/gzq091

Dobson CM, Evans PA, Williamson KL (1984) Proton NMR studies of denatured lysozyme. FEBS Lett 168:331–334. http://dx.doi.org/10.1016/0014-5793(84)80273-2

Edwards LJ, Savostyanov DV, Welderufael ZT, Lee D, Kuprov I (2014) Quantum mechanical NMR simulation algorithm for protein-size spin systems. J Magn Reson 243:107–113. http://dx.doi.org/https://doi.org/10.1016/j.jmr.2014.04.002

Ernst RR, Bodenhausen G, Wokaun A (1987) Principles of nuclear magnetic resonance in one and two dimensions. Oxford University Press, Oxford

Ferrage F, Zoonens M, Warschawski DE, Popot J-L, Bodenhausen G (2003) Slow Diffusion of Macromolecular Assemblies by a New Pulsed Field Gradient NMR Method. J Am Chem Soc 125:2541–2545. http://dx.doi.org/10.1021/ja0211407

Grage H, Akke M (2003) A statistical analysis of NMR spectrometer noise. J Magn Reson 162:176–188. http://dx.doi.org/https:// doi.org/10.1016/S1090-7807(03)00038-7

Hansen DF, Yang D, Feng H, Zhou Z, Wiesner S, Bai Y, Kay LE (2007) An Exchange-Free Measure of 15N Transverse Relaxation: An NMR Spectroscopy Application to the Study of a Folding Intermediate with Pervasive Chemical Exchange. J Am Chem Soc 129:11468–11479. http://dx.doi.org/10.1021/ja072717t

Helmus JJ, Jaroniec CP (2013) Nmrglue: an open source Python package for the analysis of multidimensional NMR data. J Biomol Nmr 55:355–367. http://dx.doi.org/10.1007/s10858-013-9718-x

Hoffmann ARF, Caillon L, Salazar Vazquez LS, Spath PA, Carlier L, Khemtémourian L, Lequin O (2018) Time dependence of NMR observables reveals salient differences in the accumulation of early aggregated species between human islet amyloid polypeptide and amyloid-β. Phys Chem Chem Phys 20:9561–9573. http://dx.doi.org/10.1039/c7cp07516b

Horst R, Horwich AL, Wuethrich K (2011) Translational Diffusion of Macromolecular Assemblies Measured Using Transverse-Relaxation-Optimized Pulsed Field Gradient NMR. J Am Chem Soc 133:16354–16357. http://dx.doi.org/10.1021/ja206531c

Huang R, Brady JP, Sekhar A, Yuwen T, Kay LE (2017) An enhanced sensitivity methyl H-1 triple-quantum pulse scheme for measuring diffusion constants of macromolecules. J Biomol NMR 68:249–255. http://dx.doi.org/10.1007/s10858-017-0122-9

Huang R, Ripstein ZA, Augustyniak R, Lazniewski M, Ginalski K, Kay LE, Rubinstein JL (2016) Unfolding the mechanism of the AAA plus unfoldase VAT by a combined cryo-EM, solution NMR study. Proc. Natl. Acad. Sci. USA 113:E4190-E4199. http://dx.doi.org/10.1073/pnas.1603980113

Ilyina E, Roongta V, Pan H, Woodward C, Mayo KH (1997) A pulsed-field gradient NMR study of bovine pancreatic trypsin inhibitor self-association. Biochemistry 36:3383–3388. http://dx.doi.org/10.1021/bi9622229

Jerschow A, Muller N (1997) Suppression of convection artifacts in stimulated-echo diffusion experiments. Double-stimulated-echo experiments. J Magn Reson 125:372–375. http://dx.doi.org/10.1006/jmre.1997.1123

Kato H, Gruschus J, Ghirlando R, Tjandra N, Bai Y (2009) Characterization of the N-Terminal Tail Domain of Histone H3 in Condensed Nucleosome Arrays by Hydrogen Exchange and NMR. J Am Chem Soc 131:15104–15105. http://dx.doi.org/10.1021/ja9070078

Kharkov BB, Podkorytov IS, Bondarev SA, Belousov MV, Salikov VA, Zhouravleva GA, Skrynnikov NR (2021) The role of rotational motion in diffusion NMR experiments on supramolecular assemblies: application to Sup35NM fibrils. Angew. Chem Int Ed 60:15445–15451. http://dx.doi.org/10.1002/anie.202102408

Kheddo P, Cliff MJ, Uddin S, van der Walle CF, Golovanov AP (2016) Characterizing monoclonal antibody formulations in arginine glutamate solutions using H-1 NMR spectroscopy. Mabs 8:1245–1258. http://dx.doi.org/10.1080/19420862.2016.1214786

Kitevski-LeBlanc JL, Yuwen T, Dyer PN, Rudolph J, Luger K, Kay LE (2018) Investigating the dynamics of destabilized nucleosomes using methyl-TROSY NMR. J Am Chem Soc 140:4774–4777. http://dx.doi.org/10.1021/jacs.8b00931

Leung RLC, Robinson MDM, Ajabali AAA, Karunanithy G, Lyons B, Raj R, Raoufmoghaddam S, Mohammed S, Claridge TDW, Baldwin AJ, Davis BG (2017) Monitoring the Disassembly of Virus-like Particles by F-19-NMR. J Am Chem Soc 139:5277–5280. http://dx.doi.org/10.1021/jacs.6b11040

Li Y, Shan B, Raleigh DP (2007) The cold denatured state is compact but expands at low temperatures: Hydrodynamic properties of the cold denatured state of the C-terminal domain of L9. J. Mol Biol 368:256–262. http://dx.doi.org/10.1016/j.jmb.2007.02.011

Li YJ, Kim S, Brodsky B, Baum J (2005) Identification of partially disordered peptide intermediates through residue-specific NMR diffusion measurements. J Am Chem Soc 127:10490–10491. http://dx.doi.org/10.1021/ja052801d

Liu Z, Zhang WP, Xing Q, Ren X, Liu M, Tang C (2012) Noncovalent dimerization of ubiquitin. Angew Chem Int Ed 51:469–472. http://dx.doi.org/10.1002/anie.201106190

Lucas LH, Price KE, Larive CK (2004) Epitope mapping and competitive binding of HSA drug site II ligands by NMR diffusion measurements. J Am Chem Soc 126:14258–14266. http://dx.doi.org/10.1021/ja0479538

Maeno A, Matsuo H, Akasaka K (2009) The pressure-temperature phase diagram of hen lysozyme at low pH. Biophysics (Nagoya-shi). 5:1–9. http://dx.doi.org/10.2142/biophysics.5.1

Maltsev AS, Ying J, Bax A (2012) Impact of N-Terminal Acetylation of α-Synuclein on Its Random Coil and Lipid Binding Properties. Biochemistry 51:5004–5013. http://dx.doi.org/10.1021/bi300642h

Mariño L, Pauwels K, Casasnovas R, Sanchis P, Vilanova B, Muñoz F, Donoso J, Adrover M (2015) Ortho-methylated 3-hydroxypyridines hinder hen egg-white lysozyme fibrillogenesis. Sci Rep 5:12052. http://dx.doi.org/10.1038/srep12052

Mestrelab (2024) MestReNova Manual (version 15.1.0). Mestrelab Research, Santiago de Compostela, Spain

Narayanan S, Reif B (2005) Characterization of chemical exchange between soluble and aggregated states of beta-amyloid by solution-state NMR upon variation of salt conditions. Biochemistry 44:1444–1452. http://dx.doi.org/10.1021/bi048264b

Peebles P (2000) Probability, Random Variables, and Random Signal Principles. McGraw-Hill Singapore

Price WS, Tsuchiya F, Arata Y (1999) Lysozyme aggregation and solution properties studied using PGSE NMR diffusion measurements. J Am Chem Soc 121:11503–11512. http://dx.doi.org/10.1021/ja992265n

Rabdano SO, Izmailov SA, Luzik DA, Groves A, Podkorytov IS, Skrynnikov NR (2017) Onset of disorder and protein aggregation due to oxidation-induced intermolecular disulfide bonds: case study of RRM2 domain from TDP-43. Sci. Rep. 7 http://dx.doi.org/10.1038/s41598-017-10574-w

Ramanujam V, Alderson TR, Pritisanac I, Ying J, Bax A (2020) Protein structural changes characterized by high-pressure, pulsed field gradient diffusion NMR spectroscopy. J Magn Reson 312. http://dx.doi.org/10.1016/j.jmr.2020.106701

Redfield C, Dobson CM (1988) Sequential 1H NMR assignments and secondary structure of hen egg white lysozyme in solution. Biochemistry 27:122–136. http://dx.doi.org/10.1021/bi00401a020

Roos M, Ott M, Hofmann M, Link S, Rössler E, Balbach J, Krushelnitsky A, Saalwächter K (2016) Coupling and Decoupling of Rotational and Translational Diffusion of Proteins under Crowding Conditions. J Am Chem Soc 138:10365–10372. http://dx.doi.org/10.1021/jacs.6b06615

Segev O, Columbus I, Ashani Y, Cohen Y (2005) Probing the molecular interaction of chymotrypsin with organophosphorus compounds by P-31 diffusion NMR in aqueous solutions. J Org Chem 70:309–314. http://dx.doi.org/10.1021/jo0485942

Soong R, Brender JR, Macdonald PM, Ramamoorthy A (2009) Association of Highly Compact Type II Diabetes Related Islet Amyloid Polypeptide Intermediate Species at Physiological Temperature Revealed by Diffusion NMR Spectroscopy. J Am Chem Soc 131:7079–7085. http://dx.doi.org/10.1021/ja900285z

Stefaniuk A, Gawinkowski S, Golec B, Gorski A, Szutkowski K, Waluk J, Poznański J (2022) Isotope effects observed in diluted D2O/H2O mixtures identify HOD-induced low-density structures in D2O but not H2O. Sci Rep 12:18732. http://dx.doi.org/10.1038/s41598-022-23551-9

Stejskal EO, Tanner JE (1965) Spin diffusion measurements: spin echoes in the presence of a time-dependent field gradient. J Chem Phys 42:288–292. http://dx.doi.org/10.1063/1.1695690

Svane ASP, Jahn K, Deva T, Malmendal A, Otzen DE, Dittmer J, Nielsen NC (2008) Early stages of amyloid fibril formation studied by liquid-state NMR: The peptide hormone glucagon. Biophys J 95:366–377. http://dx.doi.org/10.1529/biophysj.107.122895

Tillett ML, Horsfield MA, Lian LY, Norwood TJ (1999) Protein-ligand interactions measured by N-15-filtered diffusion experiments. J Biomol NMR 13:223–232. http://dx.doi.org/10.1023/a:1008301324954

Tseng BP, Esler WP, Clish CB, Stimson ER, Ghilardi JR, Vinters HV, Mantyh PW, Lee JP, Maggio JE (1999) Deposition of monomeric, not oligomeric, A beta mediates growth of Alzheimer's disease amyloid plaques in human brain preparations. Biochemistry 38:10424–10431. http://dx.doi.org/10.1021/bi990718v

Van Horn WD, Beel AJ, Kang C, Sanders CR (2010) The impact of window functions on NMR-based paramagnetic relaxation enhancement measurements in membrane proteins. Biochimica et Biophysica Acta. (BBA) - Biomembr 1798:140–149. http://dx.doi.org/https://doi.org/10.1016/j.bbamem.2009.08.022

Waelti MA, Orts J, Voegeli B, Campioni S, Riek R (2015) Solution NMR Studies of Recombinant A beta(1–42): From the Presence of a Micellar Entity to Residual beta-Sheet Structure in the Soluble Species. ChemBioChem 16:659–669. http://dx.doi.org/10.1002/cbic.201402595

Wang Y, Han G, Jiang X, Yuwen T, Xue Y (2021) Chemical shift prediction of RNA imino groups: application toward characterizing RNA excited states. Nat Commun 12:1595. http://dx.doi.org/10.1038/s41467-021-21840-x

Wang Y, Li C, Pielak GJ (2010) Effects of Proteins on Protein Diffusion. J Am Chem Soc 132:9392–9397. http://dx.doi.org/10.1021/ja102296k

Ward JM, Skrynnikov NR (2012) Very large residual dipolar couplings from deuterated ubiquitin. J Biomol Nmr 54:53–67. http://dx.doi.org/10.1007/s10858-012-9651-4

Waudby CA, Mantle MD, Cabrita LD, Gladden LF, Dobson CM, Christodoulou J (2012) Rapid Distinction of Intracellular and Extracellular Proteins Using NMR Diffusion Measurements. J Am Chem Soc 134:11312–11315. http://dx.doi.org/10.1021/ja304912c

Weljie AM, Yamniuk AP, Yoshino H, Izumi Y, Vogel HJ (2003) Protein conformational changes studied by diffusion NMR spectroscopy: Application to helix-loop-helix calcium binding proteins. Protein Sci 12:228–236. http://dx.doi.org/10.1110/ps.0226203

Wilkins DK, Grimshaw SB, Receveur V, Dobson CM, Jones JA, Smith LJ (1999) Hydrodynamic radii of native and denatured proteins measured by pulse field gradient NMR techniques. Biochemistry 38:16424–16431

Willcott MR (2009) MestRe Nova. J Am Chem Soc 131:13180–13180. http://dx.doi.org/10.1021/ja906709t

Wong LE, Kim TH, Muhandiram DR, Forman-Kay JD, Kay LE (2020) NMR Experiments for Studies of Dilute and Condensed Protein Phases: Application to the Phase-Separating Protein CAPRIN1. J Am Chem Soc 142:2471–2489. http://dx.doi.org/10.1021/jacs.9b12208

Wylie BJ, Sperling LJ, Nieuwkoop AJ, Franks WT, Oldfield E, Rienstra CM (2011) Ultrahigh resolution protein structures using NMR chemical shift tensors. Proc. Natl. Acad. Sci. USA 108:16974–16979. http://dx.doi.org/doi:10.1073/pnas.1103728108

Xi Y, Rocke DM (2008) Baseline Correction for NMR Spectroscopic Metabolomics Data Analysis. BMC Bioinformatics 9:324. http://dx.doi.org/10.1186/1471-2105-9-324

Xue Y, Yuwen TR, Zhu FQ, Skrynnikov NR (2014) Role of electrostatic interactions in binding of peptides and intrinsically disordered proteins to their folded targets. 1. NMR and MD characterization of the complex between the c-Crk N-SH3 domain and the peptide Sos. Biochemistry 53:6473–6495. http://dx.doi.org/10.1021/bi500904f

Yuwen T, Brady JP, Kay LE (2018) Probing Conformational Exchange in Weakly Interacting, Slowly Exchanging Protein Systems via Off-Resonance R1ρ Experiments: Application to Studies of Protein Phase Separation. J Am Chem Soc 140:2115–2126. http://dx.doi.org/10.1021/jacs.7b09576

Zick K (2016) Diffusion NMR user manual (version 004). Bruker Biospin GmbH, Rheinstetten, Germany

Yes

Abstract

In this communication we describe a new scheme to process the data from stimulated echo protein diffusion experiments. For a series of gradient-encoded proton spectra fk(ω) considered over the selected spectral region (ωleft,ωright), we build a model to approximate the unique (protein-dependent) shape of the spectrum. Taking a cue from the optimal filtration theory, fmodel(ω) is constructed as the intensity-weighted combination of fk(ω). The so obtained fmodel(ω) is then used to fit the individual spectra fk(ω), thus providing highly accurate estimates for the integral signal intensities that are subsequently used for Stejskal-Tanner-type analyses. This algorithm has been implemented as a part of a new web server, named DDfit (https://ddfit.bio-nmr.spbu.ru/). The server accepts spectrometer data from the standard stimulated and double-stimulated echo experiments by Bruker, as well as custom-designed experiments. The server is easy to use, with data processing taking no more than several seconds. Our tests using simulated as well as experimental data found that DDfit determines protein diffusion coefficients with both accuracy and precision, offering several-fold improvement in precision compared to other processing schemes.