Randomness on Repeat – Ocean of Math

I recently upgraded from an old phone with a 3.5mm headphone jack to one that’s too new and fancy for such antiquated features. This created a nuisance because I had only wired 3.5mm headphones available, so instead of using my phone as an MP3 player I resorted to a dedicated MP3 player from the 2010s for listening to music at work. When I had last used this device, I was a plebeian Windows user so I could use the player’s sync capabilities with Windows Media Player to create playlists. But now, being a seasoned professional, I use Linux which doesn’t support the playlist syncing. The best I could do was to put the whole music collection on the player, hit “shuffle”, and hope for the best.

After a few days the player seemed to develop a preference for Christmas songs and for one particular song that I hadn’t even thought about in years. “Seemed” is the operative word here, as a truly random sequence of songs would occasionally exhibit such behavior and my brain was probably just instinctively looking for structure in the randomness. Back in the early years of the iPod, the mismatch between actual (well, pseudo-) randomness and listeners’ expectations of how randomness ought to behave got so many people convinced that their iPods had favorite songs that Apple adjusted the algorithm to be less random, so that people would think it was more random.

The question for this post is: assuming true randomness, how long should one expect to have to wait to get a repeated song?

Working It Out

There are two main patterns that seem non-random even though they are. First, repeating a single song that was played “recently”; and second, playing a song from the same album as a previous recent song. To address both questions at once, we will use $p$ to denote the probability that any two randomly selected songs “match”, where “match” can mean “are the same song” or “are from the same album” depending on what you’re interested in.

Our random variable $X$ is the number of played songs before a “match” is encountered. The typical way of writing the expected value $E[X]$ would be:

$$\displaystyle E[X] = \sum_{R=1}^\infty RP(X = R) $$

which uses the probability density function $P$. It will be easier, however, to use an alternate formulation using the survival function:

$$\displaystyle E[X] = \sum_{R=0}^\infty P(X > R) $$

This form can be seen to be equivalent to the usual form by viewing $RP(X = R)$ as $R$ copies of $P(x=R)$, then rearranging the summands. The continuous analog is just an application of Fubini’s theorem:

$$\displaystyle \begin{aligned} \int_0^\infty xP(X=x)\,dx &= \int_0^\infty \int_0^x P(X=x)\,dy \,dx \\ &= \int_0^\infty \int_y^\infty P(X=y) \,dx\,dy \\ &= \int_0^\infty P(X>y)\,dy \end{aligned} $$

The survival function $P(X > R)$ is simply the probability that the first $R$ songs are all distinct. We’ll assume that $R$ is small compared to the total number of songs, so that the pairs among those songs are approximately independent (this is the first of many unsatisfying approximations we’ll have to make in order to get a tractable estimate in the end). With $R$ songs, there are $\frac{R(R-1)}{2}$ pairs, so the probability that they are all different is:

$$\displaystyle P(X > R) \approx (1-p)^{R(R-1)/2} $$

We will make another unsatisfying approximation that $1-p \approx e^{-p}$ which is reasonable for sufficiently small $p$:

$$\displaystyle P(X > R) \approx e^{-pR(R-1)/2}\approx e^{-pR^2/2} $$

incurring another unsightly approximation. Then we will approximate the discrete sum by a continuous integral:

$$\displaystyle E[X] = \sum_{R=0}^\infty P(X>R) \approx \int_0^\infty e^{-pR^2/2} \,dR = \sqrt{\frac{\pi}{2p}} $$

If “match” means the same individual song, then $p = \frac{1}{N}$ where $N$ is the total number of songs. In my case, $N = 776$ so $E[X] \approx 34.9$ is about how many songs to play before expecting a repeat. Considerably smaller than $388 = N/2$ one might have expected before thinking about the problem! Ultimately, this is because of the quadratic $R^2$ term in the exponential.

If we instead consider songs to match if they come from the same album, we need to do some extra work to compute $p$. If album $i$ has $a_i$ songs in it, still with $N = \sum_i a_i$ songs in total, then the probability that two randomly chosen songs are from the same album is:

$$\displaystyle p = p_A = \sum_i \left( \frac{a_i}{N} \right)^2 $$

We get back the $p = \frac{1}{N}$ for a single song by pretending that every song is its own album. Interestingly, it is always the case that $p_A \ge p$, so same-album repeats are more likely than same-song repeats which is what you would expect.

I really don’t like all the approximations we had to do to get here. But without them it would be tough to get a compact, closed-form answer at the end so it’s worth it.