A Methodological Framework for the Statistical Assessment of Random Number Generators and the Verification of Return-to-Player in Online Gaming Software

Abstract

I report on the methodology I have developed and applied for the technical assessment of random number generators (RNGs) used to determine outcomes in remote gambling products, together with the verification of theoretical and empirical return-to-player (RTP) values for slot and table games driven by those generators. The framework couples a white-box source-code review, two complementary statistical batteries (the NIST Statistical Test Suite based on Special Publication 800-22 Revision 1a, and the Dieharder battery descended from Marsaglia's Diehard tests), a Software Input Validation procedure that maintains traceable evidence that the testing tooling has not drifted from a recorded baseline, and a two-stage RTP audit combining exhaustive enumeration or analytical derivation with Monte Carlo simulation driven by the certified RNG. The framework is designed to be reproducible by independent reviewers and to satisfy the technical randomness and player-protection requirements adopted by European gaming regulators.

1. Introduction

The integrity of any RNG-driven gambling product rests on two distinct yet coupled assumptions. The first is that the underlying RNG produces sequences that are unpredictable, non-repeatable, uniformly distributed over the declared output range, and statistically independent across outputs. The second is that the deterministic mapping from raw RNG values to game outcomes, together with the published pay tables, returns to players, on average, the long-run percentage advertised by the operator. Failure of either assumption compromises both the fairness of the game and the regulatory compliance of the licensee.

I treat these two assumptions as separable evidentiary questions answered by distinct but interlocking procedures. RNG quality is established through a combination of source-code inspection and large-sample statistical testing of the operationally equivalent generation path. Return-to-player is established through independent mathematical derivation of the theoretical RTP, followed by Monte Carlo simulation of the game driven by the certified RNG, with the two values reconciled within a stated confidence interval.

This paper proceeds as follows. Section 2 presents the white-box source-code review and algorithmic classification activities. Section 3 describes the statistical testing methodology, the two-tailed confidence-interval interpretation, and the multisample majority-vote rule. Section 4 describes the Software Input Validation procedure that maintains the tooling baseline. Section 5 sets out the RTP verification methodology for both slot and table games, including sample-size determination and convergence analysis. Section 6 discusses limitations and directions for further work.

2. White-Box Source-Code Review and Algorithmic Classification

White-box review constitutes the first evidentiary stage. I obtain the full production-matching source code, all entry-point files, the dependency manifest, and the build artefacts, and I independently recompute SHA-1 checksums on every file before any further activity. The client confirms the recomputed checksums in writing against the production deployment; only at that point is the engagement scope formally frozen.

2.1 Construction Identification

I identify the top-level RNG construction in use, classifying it as either a pseudorandom number generator (PRNG), a deterministic random bit generator (DRBG) in the sense of NIST SP 800-90A (Barker and Kelsey, 2015), or a composite construction in which an operating-system entropy primitive feeds an application-layer DRBG. Approved constructions include Hash_DRBG, HMAC_DRBG, and CTR_DRBG, as well as language-provided cryptographically secure RNGs that wrap the kernel CSPRNG (for example java.security.SecureRandom, crypto.randomBytes, getrandom(2), BCryptGenRandom, random_bytes, and secrets). Non-approved constructions, including linear congruential generators, the Mersenne Twister MT19937 used for security-relevant output, the withdrawn Dual_EC_DRBG, and any custom hash- or stream-cipher based construction without a documented security argument, are flagged as findings.

For every identified construction I perform and record a documented search of the National Vulnerability Database and the upstream advisory channel for the specific library version in use. I further record the publicly recoverable period of the generator and confirm that the declared operating lifetime cannot exhaust that period; for DRBGs the reseed interval is confirmed to lie within the bound documented for the selected construction (Barker and Kelsey, 2015).

2.2 Entropy Sources, Seeding, and Reseeding

I trace the ultimate entropy source (typically a kernel CSPRNG such as the Linux /dev/urandom ChaCha20 construction, or BCryptGenRandom on Windows) and confirm it is configured as declared. I reject seed material drawn from process identifiers, time-based primitives such as time(NULL) or System.currentTimeMillis(), MAC address or hostname, hard-coded constants, or user-controllable input. I confirm the reseed source is the declared entropy source and not a cached value, and I examine fail-secure behaviour under entropy starvation in virtualised hosts, where transient depletion of the kernel entropy pool has been documented (Heninger et al., 2012).

2.3 Scaling, Mapping, and Modulo Bias

The mapping from raw RNG output to a bounded outcome space is the second-most common defect site after weak seeding. At every mapping site I reject the naïve construction raw % N where N does not exactly divide the raw output range, since this introduces a classical modulo bias whose magnitude is bounded below by 1/N for the over-represented residues. I accept rejection sampling, Lemire's multiplication-and-shift construction (Lemire, 2019), and language-provided bounded-integer APIs that are documented as unbiased for the runtime in use. Floating-point scaling (for example (int)(rand() * max)) is rejected even when the underlying RNG is cryptographically secure, because IEEE-754 rounding non-uniformly concentrates probability mass at the endpoints of the target range.

For weighted lookups (reel strips, virtual reels, symbol-weight tables) I verify that the cumulative-weight lookup is implemented correctly, that the weights are loaded from a configuration source that is immutable from any user-reachable path, and that the configuration was frozen at the scope-freeze step.

2.4 Shuffling and Sampling

For card and tile shuffles I require a correctly implemented Fisher-Yates algorithm (Knuth, 1998), in-place, with the per-iteration index drawn from a half-open interval using an unbiased bounded-integer routine. I reject the "sort by random key" shuffle, which does not produce a uniform permutation distribution, and I verify that for sampling without replacement the consumed item is removed from the candidate pool by a documented mechanism rather than relying on incidental data structures.

2.5 Injection and Disclosure Surface

I enumerate every code path that can set, reset, influence, or disclose the RNG seed or internal state, including administrative endpoints, environment variables, configuration-file overrides, debug or feature-flagged modes, and any externally reachable diagnostic channel. I verify that the path is either absent in the production build, compile-time excluded, or gated by an authenticated and audited control. I confirm that raw RNG output and internal state are not written to log files, error responses, crash dumps, or APM spans, since disclosure of sufficient consecutive raw outputs is sufficient to recover the full internal state of weak constructions (for example, 624 consecutive 32-bit outputs uniquely determine the state of MT19937 (Matsumoto and Nishimura, 1998)).

3. Statistical Assessment of Raw RNG Output

Once the source-code review concludes without unresolved critical findings, I subject the operationally equivalent generation path to two statistical batteries operating in parallel on three independently captured binary files.

3.1 NIST SP 800-22 Battery

I execute the fifteen-test battery defined by NIST SP 800-22 Revision 1a (Rukhin et al., 2010) using the official assess binary distributed by the NIST Computer Security Resource Center. The chosen operating point is n = 1,000,000 bits per sequence and m = 1,000 bitstreams per file, which corresponds to the canonical operating point identified in §4.3 of the standard, where

n is the sequence length in bits,
m is the number of independent bitstreams per file.

For each test the battery reports two derived statistics: the proportion of sequences whose per-sequence p-value satisfies p ≥ α, and a chi-square uniformity p-value computed from the histogram of the per-sequence p-values across ten equal-width bins. A file is judged to pass a given test at confidence level (1 − α) only if both the proportion and the uniformity p-value lie inside the corresponding acceptance intervals derived from the normal approximation to the binomial distribution.

For the five tests that emit more than one p-value per invocation (Cumulative Sums, Serial, Non-overlapping Templates, Random Excursions, and Random Excursions Variant), I treat the sub-results differently according to the cardinality K of the sub-result set. For small K (Cumulative Sums and Serial, where K = 2) every sub-result must pass. For large K (Non-overlapping Templates with K = 148, Random Excursions with K ≤ 8, Random Excursions Variant with K ≤ 18) I apply a Bernoulli sub-proportion rule derived from the same confidence formula, since requiring all sub-results to pass would inflate the family-wise Type I error beyond the declared significance level. Where the data does not support a state-conditional sub-test (a common occurrence for Random Excursions on short or near-zero-mean captures) the affected row is excluded from both the numerator and denominator of the sub-proportion test rather than treated as a failure.

3.2 Dieharder Battery

I execute fifteen tests from the Dieharder distribution (Brown et al., 2018), the open-source descendant of Marsaglia's Diehard battery (Marsaglia, 1995). Each test is invoked at three values of the -p (psamples) parameter, namely 100, 500, and 1000, which together implement the escalation hierarchy described by Brown for generators that produce ambiguous results at lower precision. I do not escalate beyond -p 1000: at higher precision the test consumes input on the order of hundreds of gigabytes and pushes even reference cryptographic generators into nominal failure, with no useful discrimination remaining.

Each independent capture file is at least 10 GiB, sized to prevent the silent file-rewind behaviour that Dieharder exhibits at the highest psamples tier on the heaviest tests. The same capture files are reused across the NIST and Dieharder batteries, since the 10 GiB Dieharder requirement subsumes the 125 MB per-file minimum established by the NIST operating point.

3.3 Two-Tailed Interpretation

I evaluate every p-value bilaterally at both the 95% and 99% confidence levels, treating both tails as failure zones. A p-value close to unity is as statistically unusual under the null hypothesis as one close to zero, and I therefore apply the symmetric acceptance intervals:

0.025 ≤ p ≤ 0.975 at the 95% level

0.005 ≤ p ≤ 0.995 at the 99% level

This is consistent with Brown's explicit guidance for Dieharder interpretation and removes a documented source of operator bias in which one-tailed thresholds permit a generator to pass with anomalously high p-values that are themselves diagnostic of weak entropy or pathological internal state.

3.4 Multisample Majority-Vote Rule

A single capture file evaluated at a single tier produces a result that is subject to two distinct sources of noise: the intrinsic statistical variation of a passing generator at the declared significance level, and transient artefacts of the capture environment. To absorb both, I apply a two-stage majority vote over a nine-cell grid of three independent capture files times three psamples tiers (or, for the NIST battery, three capture files at the fixed canonical operating point).

In the first (inner) stage, for each file and each test, I count how many psamples tiers pass at the declared CI level. The per-file verdict is PASS when at least two of the three tiers pass. In the second (outer) stage I count how many files pass the inner vote; the test passes overall when at least two of the three files pass. The engagement-level verdict combines the per-test outcomes at both the 95% and 99% confidence levels: a confirmed failure at either level constitutes engagement failure and triggers a recapture, with the prohibition that parameters are not tuned to chase a different outcome. The two-stage construction is statistically robust against incidental noise while remaining sensitive to a persistent defect across either dimension.

4. Software Input Validation of the Testing Tooling

The reproducibility argument of the methodology depends on the assertion that the testing tooling has not drifted between the date the baseline was established and the date of any subsequent engagement. I address this by performing periodic Software Input Validation against fixed reference datasets.

4.1 Reference Datasets

I maintain two reference files for the Dieharder validation: one generated from Dieharder's built-in AES_OFB generator (the "good" dataset) and one generated from the historical IBM RANDU linear congruential generator (the "bad" dataset). The "good" dataset is expected to pass every test within statistical tolerance; the "bad" dataset is expected to fail decisively on the binary-rank, OPSO, OQSO, DNA, and parking-lot tests, in agreement with the well-documented failure modes of RANDU (Knuth, 1998; Marsaglia, 1995). Both files are exactly 10 GiB and are identified by SHA-1 hash recorded at the time the baseline was established.

For the NIST validation I use the data.pi reference file distributed with the official sts-2.1.2 release, comprising 1,000,000 ASCII-encoded bits drawn from the binary expansion of π. The expected p-values for data.pi are tabulated in Appendix B of the standard (Rukhin et al., 2010) and serve as the externally fixed baseline against which the local installation is compared.

4.2 Baseline and Tolerance

For Dieharder the baseline is local: I record every p-value returned at the time of baseline establishment and require every subsequent run to reproduce those values to at least six decimal places. The validation succeeds only if every one of the thirty test invocations (fifteen tests over two reference files) matches the baseline; there is no intermediate outcome. For NIST the baseline is external, fixed by NIST's publication; tolerance is ±0.001 on each of the sixteen Appendix B p-values, matching the precision at which NIST itself publishes the reference values. Any deviation outside the documented tolerance is measurable evidence that the installation no longer behaves as the NIST reference implementation, and the affected tool is taken out of service pending root-cause analysis.

5. Return-to-Player Verification

RTP verification proceeds in three coupled stages: an independent derivation of the theoretical RTP from first principles, a Monte Carlo simulation driven by the certified RNG, and a reconciliation of the two values within a stated confidence interval. The same engagement also produces an independently measured estimate of game volatility, which is used both as the standard deviation in the confidence-interval calculation and as input to live monitoring tolerance bands.

5.1 Theoretical RTP for Slot Games

For slot games with fixed reel strips and a tractable state space I derive the theoretical RTP by exhaustive enumeration. The reel-strip symbol counts are extracted from the source code and cross-validated against the developer's Probability Accounting Report (PAR sheet) in a per-reel symbol-count matrix; any discrepancy of one or more symbols on any reel is treated as a major non-conformity. Enumeration software then iterates over every possible combination of reel stops, evaluates the pay table on each outcome, weights each payout by its joint probability of occurrence, and sums across the state space to produce the base-game contribution. Each feature game state (free spins, respins, bonus rounds) is enumerated independently and weighted by its trigger probability; progressive jackpot contributions are calculated from the declared increment rate and the expected hit interval.

For game families whose state space is computationally impractical to enumerate (cascading-symbol mechanics, Megaways constructions with variable reel heights, and multi-level progressive bonuses), I substitute Monte Carlo simulation as the primary method and retain enumeration for any tractable sub-state.

5.2 Theoretical RTP for Table Games

For RNG-driven table games the probability space is known analytically. For a single wager type I evaluate

RTP_bet = Pr(win) × (payout odds + 1)

where

Pr(win) is the probability that the wager pays out, derived from the combinatorial structure of the game (pocket distribution for roulette, deck composition for baccarat, dice probabilities for craps),
payout odds is the declared payout ratio expressed in units of the wager (for instance, 35 for a European straight-up roulette bet, giving 36 / 37 ≈ 97.30%).

I tabulate this expression over every bet type in the menu. For baccarat I evaluate the banker, player, and tie bets using combinatorial probability across the declared number of decks, applying the declared commission rate to the banker bet and treating each side bet as an independent certification item. For blackjack I derive RTP under a complete Basic Strategy Decision Matrix consistent with the declared rule variant (number of decks, dealer behaviour on soft 17, double-after-split availability, surrender option); the strategy assumption is recorded as part of the certificate.

5.3 Monte Carlo Simulation and Sample-Size Determination

The Monte Carlo leg of the audit is driven exclusively by the certified RNG in its production-equivalent configuration; substitution by any other random source invalidates the run. The minimum number of rounds per RTP version is determined by the desired confidence-interval half-width via

N_min = (1.96 × σ / ε)²

where

N_min is the minimum number of simulated rounds required,
σ is the game volatility expressed in bet units (the standard deviation of per-round payout divided by the bet unit),
ε is the target half-width of the 95% confidence interval, with a default value of 0.005 (that is, ±0.5 percentage points),
the factor 1.96 is the z-score for a two-tailed 95% confidence level.

I enforce absolute floors on the computed N_min that vary with the declared volatility band:

Volatility band	Range of σ	Minimum simulated rounds
Low	σ < 5.0	10,000,000
Medium	5.0 ≤ σ < 10.0	50,000,000
High	10.0 ≤ σ < 20.0	100,000,000
Extreme	σ ≥ 20.0	250,000,000

For games with rare bonus triggers (trigger probability less than 1 in 500) the floor is increased to the larger of the volatility-band requirement and 1000 divided by the trigger probability, ensuring that the bonus pathway is itself adequately sampled. Table-game cross-validation requires at least 10,000,000 rounds per bet type.

For games offering multiple selectable RTP configurations on a single binary, each configuration is validated independently against the volatility-band floor; shared-binary testing of one configuration does not certify the others.

5.4 Convergence Analysis and Acceptance Criterion

At the conclusion of each simulation run I plot the running average RTP against the number of completed rounds and confirm visible convergence to the theoretical value. I compute the 95% confidence interval as

CI = ±1.96 × σ / √n

where

σ is the independently measured game volatility (see §5.5),
n is the realised round count of the simulation run,
the factor 1.96 corresponds to the two-tailed 95% confidence level; the analogous z-score for the 99% level is 2.576.

I require that the confidence interval not exceed the 99% level for any tier, since wider tolerances are capable of masking defective game behaviour. The acceptance criterion is satisfied when the simulated RTP falls within the 95% confidence interval of the theoretical value; the same criterion is applied independently to each major component (base game, each feature, each jackpot tier, each bet type for table games), and the per-component reconciliation table forms a mandatory exhibit of the engagement report.

A separate regulatory floor applies independent of the statistical reconciliation: each declared theoretical RTP configuration must equal or exceed 85%, the minimum threshold adopted under Article 22 of the relevant player-protection directive. Any configuration falling below this floor is flagged as a regulatory non-conformity irrespective of the statistical outcome.

5.5 Independent Volatility Verification

I derive σ from the enumerated probability distribution and from the simulated payout series, expressing the result in bet units (the standard deviation of per-round payout divided by the bet unit). The two estimates are compared against the client's declared volatility under the relative-tolerance criterion

| σ_measured − σ_declared | / σ_declared ≤ 0.05

where

σ_measured is my independent estimate from enumeration or simulation,
σ_declared is the volatility figure supplied by the client and printed on the certificate.

The verified σ is recorded on the certificate and is used as input to all live monitoring tolerance calculations performed by the operator after deployment.

5.6 Chi-Square Testing of Dealing, Wheel, and Dice Algorithms

For table games I supplement the RTP audit with a chi-square goodness-of-fit test on the empirical distribution of dealing, wheel, or dice outcomes. The test is executed over at least 10,000,000 simulated trials, applying

χ² = ∑ᵢ (Oᵢ − Eᵢ)² / Eᵢ

where

Oᵢ is the observed frequency of outcome i,
Eᵢ is the expected frequency of outcome i under the declared shuffle, wheel, or dice algorithm,
the summation is taken over all outcome categories i = 1, 2, …, k,
the resulting statistic is referred to the χ² distribution with k − 1 degrees of freedom for a one-way classification, or (r − 1)(c − 1) degrees of freedom for an r-by-c contingency table.

For a single die the test verifies uniformity across the six faces; for a 37-pocket European roulette wheel the test verifies uniformity across all pockets; for card games the test verifies that the empirical frequency of each of the 52d cards in a d-deck shoe matches its expected frequency under the declared shuffle algorithm. Inter-deal independence is checked separately by confirming that the remaining shoe composition is correctly updated after every draw, since dependence between successive deals can mask uniformity defects in the per-deal distribution.

6. Discussion and Limitations

The methodology I have set out is designed to satisfy the dual obligations of regulatory compliance and scientific reproducibility. Three properties are central to that design.

First, the source-code review and the statistical batteries are evidentiary on separable properties of the same generator. A statistically passing capture does not, on its own, demonstrate that the production pathway is identical to the captured pathway; this is established by the source-code review and by the cryptographic checksums recorded at the scope-freeze step. Conversely, a clean source-code review does not, on its own, demonstrate freedom from subtle entropy or scaling defects detectable only at large sample size; this is established by the statistical batteries.

Second, the multisample majority-vote rule explicitly addresses the family-wise error inflation that arises from running a large battery on a single capture. The combined verdict over three files at two CI levels is robust against incidental statistical noise yet remains sensitive to a defect that persists across independent captures.

Third, the RTP reconciliation embeds a non-tightenable tolerance: the 95% confidence interval grows with σ and shrinks with √n, and the round-count floor is itself a function of σ. The construction prevents a high-volatility game from being certified on a low-volatility-sized sample, which would otherwise produce a deceptively narrow tolerance and a correspondingly weak guarantee against systematic defect.

The methodology has three principal limitations that I note for completeness. The first is that any white-box review depends on the supplied source code being a faithful representation of the production deployment; I mitigate this by independent recomputation of cryptographic checksums, by an operational-equivalence statement signed by the client, and by build-from-source verification where licence-restricted toolchains permit. The second is that statistical batteries are computationally bounded: they detect deviations at the precision afforded by the operating point and at no greater precision. Generators that defeat the battery at the chosen operating point cannot be certified by the battery itself, although the source-code review provides an independent path to detection. The third is that RTP reconciliation rests on the assumption that the certified RNG is the only source of randomness in the simulated game; substitution by an unverified random source during simulation, whether deliberate or accidental, invalidates the entire RTP leg of the audit. The reproducibility evidence captured by the simulation harness is designed to make such substitution detectable after the fact.

Future work will examine extension of the multisample voting framework to ensemble-of-batteries constructions (BoostTest, TestU01) and a re-derivation of the volatility tolerance criterion under heavy-tailed payout distributions, where the central-limit assumptions underlying the 5% relative-tolerance bound may admit refinement.

References

Barker, E. and Kelsey, J. (2015). Recommendation for Random Number Generation Using Deterministic Random Bit Generators. NIST Special Publication 800-90A Revision 1, National Institute of Standards and Technology, Gaithersburg, MD.

Brown, R. G., Eddelbuettel, D. and Bauer, D. (2018). Dieharder: A Random Number Test Suite, version 3.31.1. Open-source distribution and documentation, Duke University.

Heninger, N., Durumeric, Z., Wustrow, E. and Halderman, J. A. (2012). Mining your Ps and Qs: detection of widespread weak keys in network devices. In Proceedings of the 21st USENIX Security Symposium, pages 205–220.

Knuth, D. E. (1998). The Art of Computer Programming, Volume 2: Seminumerical Algorithms, 3rd edition. Addison-Wesley, Reading, MA. Sections 3.2 (Generating Uniform Random Numbers) and 3.4.2 (Random Sampling and Shuffling).

L'Ecuyer, P. and Simard, R. (2007). TestU01: a C library for empirical testing of random number generators. ACM Transactions on Mathematical Software, 33(4), Article 22.

Lemire, D. (2019). Fast random integer generation in an interval. ACM Transactions on Modeling and Computer Simulation, 29(1), Article 3.

Marsaglia, G. (1995). DIEHARD: A Battery of Tests of Randomness. Florida State University, Tallahassee, FL.

Matsumoto, M. and Nishimura, T. (1998). Mersenne Twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation, 8(1), 3–30.

Rukhin, A., Soto, J., Nechvatal, J., Smid, M., Barker, E., Leigh, S., Levenson, M., Vangel, M., Banks, D., Heckert, A., Dray, J. and Vo, S. (2010). A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications. NIST Special Publication 800-22 Revision 1a, National Institute of Standards and Technology, Gaithersburg, MD.

Schindler, W. and Killmann, W. (2002). Evaluation criteria for true (physical) random number generators used in cryptographic applications. In Cryptographic Hardware and Embedded Systems (CHES 2002), Lecture Notes in Computer Science, vol. 2523, pages 431–449, Springer.

Turan, M. S., Barker, E., Kelsey, J., McKay, K. A., Baish, M. L. and Boyle, M. (2018). Recommendation for the Entropy Sources Used for Random Bit Generation. NIST Special Publication 800-90B, National Institute of Standards and Technology, Gaithersburg, MD.