The Zung Self-Rating Depression Scale (SDS): structure and interpretation
Why another depression scale matters
In Russian-speaking clinical practice, "the depression scale" usually means BDI-II or PHQ-9. The Zung Self-Rating Depression Scale (William W.K. Zung) is a third valid instrument of the same class — just less visible. Not "worse" — under-recognised. SDS was published in 1965, 31 years before BDI-II and 36 years before PHQ-9, and across six decades it has accumulated a psychometric record comparable to any modern alternative.
SDS has one structural property absent from BDI-II, PHQ-9, and HAM-D: half its items are positively worded ("I eat as much as I used to", "I find it easy to think clearly") and reverse-scored. Zung's intent was direct: if every question is phrased in the same direction, respondents may agree with everything out of inertia — a phenomenon known as acquiescence bias. Wording balance is a structural antidote. And this is not design rhetoric: factor analyses by Kawada & Suzuki (1993, N=3,178) and the meta-analysis by Shafer (2006, N=12,621 across 13 studies) confirm that SDS positive items load on a distinct Positive Affect factor — absent from BDI-II and PHQ-9.
SDS includes 10 positively worded items that are reverse-scored. BDI-II contains zero. PHQ-9 contains zero. CES-D contains four. This balance is the SDS primary methodological signature — confirmed by factor analysis, not just declared by the original author.
Structure: 20 items and four factors
SDS consists of 20 items, each rated on a 4-point scale (1 — "a little of the time", 4 — "most of the time"). The reference period is the past two weeks, same as PHQ-9 and BDI-II. The raw total ranges from 20 to 80. Internal consistency in modern validations is consistently high: Cronbach's α = 0.86 in Dunstan & Scott (2017), 0.89 in Ruiz-Grosso (2012), 0.87 in Mammadova (2012).
Zung organised the items around four dimensions of depression. The factor structure has held up in subsequent verifications: Romera et al. (2008), in 1,049 primary-care patients with MDD, recovered four factors with congruence coefficients of 0.87–0.98, and the Shafer 2006 meta-analysis confirmed factor stability across more than 12,000 aggregated respondents.
- Pervasive affect — overall affective tone: depressed mood, crying spells, diurnal variation
- Physiological equivalents — somatic equivalents: sleep, appetite, libido, fatigue, psychomotor changes
- Psychological equivalents — cognitive and motivational disturbances: hopelessness, indecisiveness, feelings of emptiness, thoughts of death
- Psychomotor activities — psychomotor slowing or agitation: retardation, restlessness, irritability
"The specific depression symptom factors within each test appear to be relatively robust and well established and match fairly closely previously hypothesized factor structures."— Shafer A.B., Journal of Clinical Psychology, 2006, meta-analysis of BDI / CES-D / HRSD / SDS factor structures
Index score and interpretation thresholds
SDS defining technical feature is the conversion of raw score to index. The raw sum (20–80) is multiplied by 100 and divided by 80, producing an index score on a 25–100 scale — a standardised metric convenient for cross-study comparison and for interpretation as a percentage of the theoretical maximum. Zung's classic thresholds: <50 normal; 50–59 mild; 60–69 moderate; ≥70 severe.
Here lies the most common methodological error with SDS: confusing raw and index scores. Applying a cutoff of ≥50 to the raw 20–80 sum (instead of the 25–100 index) demands substantially heavier symptom load to pass screening. Dunstan, Scott & Todd (2017) measured the cost: sensitivity drops from 93% to 56%. Half of depressive cases slip through. The fix is mechanical: always convert the sum to index using (raw / 80) × 100 before comparing to any cutoff.
Zung's standard cutoff is index ≥50, not raw ≥50. They are different scales, and substituting one for the other nullifies screening validity. If you are using SDS, build automatic conversion into the instrument formula or platform spec. Index ≥50 does not indicate a diagnosis of major depressive disorder — it indicates that clinical attention is warranted.
SDS vs BDI vs PHQ-9: choosing the right tool
All three scales are valid and widely used. The choice is a function of the task, not prestige. BDI-II is built on Beck's cognitive-affective model, contains 21 items, and is more often applied to assess severity in already-identified patients; it is a commercial Pearson product with licensing requirements. PHQ-9 is 9 items, directly mapped to DSM-5 criteria for major depressive disorder, public domain, the de facto primary-care and MBC standard. SDS is 20 items, with balanced wording, free of charge, oriented to current state without rigid attachment to any single diagnostic system, and carries a long historical record of compatible data.
- Choose PHQ-9 if you need a brief primary-care screen with DSM-5 alignment and a built-in suicide ideation item
- Choose BDI-II if you are rating severity in an already-diagnosed patient and working within a cognitive-behavioral framework
- Choose SDS if you are running epidemiological screening, conducting cross-cultural research, or the patient has historical SDS data from prior care
- Choose HAM-D if you need clinician-rated rather than self-report measurement
Where SDS shines
Epidemiological research. Balanced wording makes SDS more resistant to response style bias — important when comparing groups with different cultural norms around "agreeing with authority", and in large-population screens where acquiescence can systematically distort estimates. Cross-cultural research. SDS has been psychometrically validated in more than a dozen languages indexed in the international literature: English, Spanish, Brazilian Portuguese, Chinese, Japanese, Arabic, Italian, Finnish, Persian, Dutch, German, Azerbaijani. When the task is to compare depressive symptomatology across countries over a long timespan, SDS provides retrospective compatibility that neither PHQ-9 nor BDI-II can match.
Geriatrics and somatic medicine — with caveats. Somatic items (sleep, appetite, libido, fatigue) can produce false positives in elderly and chronically ill patients. This is a documented limitation, and it calls for cutoff adjustment, not abandonment of the tool. Jokelainen et al. (2019) showed in 520 Finnish adults aged 72–73 that the optimal cutoff in this population is meaningfully lower than the standard: sensitivity 79%, specificity 72%, AUC 0.85 — statistically indistinguishable from BDI-21. Similar pattern in Parkinson disease (Chagas et al. 2009, Brazil): cutoff was raised to index ≥55 to compensate for motor symptoms, yielding sensitivity 89%, specificity 83%.
Limitations and caveats
Somatic items are the principal source of systematic error, as outlined above. In an elderly patient with arthritis and disturbed sleep, the four somatic items can lift the index by 10–15 points independent of mood. The fix: triangulate SDS against another instrument (PHQ-9, HAM-D, or clinical interview) and read trajectory rather than a single point.
SDS does not contain a discrete suicidal-ideation item — unlike PHQ-9 (item 9) and BDI-II (item 9). This means SDS should not be used as the sole screen for suicide risk. If SDS is your primary monitoring instrument, add a dedicated suicide screen (Columbia-Suicide Severity Rating Scale, or at minimum a direct clinical question at every session). SDS cutoffs are not universal: international validations span from ≥39 (raw, in a Finnish geriatric population) to ≥55 (index, in Brazilian PD patients). If you use SDS in a research protocol, always cite the cutoff from your specific validation source — not "standard" 50/60/70.
Russian-language adaptation — a gap worth flagging honestly. In Russian clinical practice the T.I. Balashova adaptation (Bekhterev Research Psychoneurological Institute, Saint Petersburg) is widely used, sometimes under the title "Scale of Reduced Mood — Subdepression (SRM-S)". However, no peer-reviewed psychometric validation with ROC analysis, sensitivity/specificity, and factor structure on a Russian-speaking sample has been indexed in major international literature. This does not disqualify the instrument from clinical use — it does mean that research protocols intended for peer-review publication may require additional sample-specific validation.
"[Self-report depression scales] cannot reach a satisfactory level of diagnostic accuracy […] probably for the absence of an optimal procedure to select test items and subjects with clearly defined pathological symptoms."— Tommasi, Ferrara & Saggino, Frontiers in Psychology, 2018, Bayesian reanalysis of PHQ / BDI / CES-D / SDS
SDS in repeated measurement
The real strength of any valid depression scale lies not in a single measurement but in a sequence. SDS index 56 at session 1 says almost nothing: "above normal", but you do not know how often this person sits in that range, or which way they are moving. SDS index 56 → 48 → 42 → 38 across sessions 1, 4, 8, 12 — that is a story you can act on: is the patient responding to the current treatment plan, does the method or intensity of intervention need adjustment, should you intensify or step down the therapeutic focus.
What counts as a clinically meaningful change. No peer-reviewed MCID for SDS, derived through anchor-based or distribution-based methodology, exists in the literature — unlike PHQ-9, where MCID of approximately 5 points is well documented. In practice clinicians use a shift of around 10 index points as a working reference, but this is a convention, not a validated metric. The practical takeaway: watch the trajectory in clinical context, not a fixed "how many points equals improvement".
The MBC framework is simple: choose a valid instrument, agree on measurement frequency, follow the trajectory, adjust the course when there is no movement or things get worse. SDS fits that loop as well as PHQ-9, especially if the patient already has historical SDS data from prior care. The "forgotten" scale is just a scale with bad marketing. SDS holds its own against modern alternatives in most tasks and outperforms them in some: wording balance, long historical continuity, retrospective data compatibility. The right question is not "which scale is best" — it is "which scale answers my task".