One of my little side projects is something I call my "Problem" database – it's a couple of gigabytes of random data that I can use to quantify the performance impact of natural keys vs. surrogate keys for date dimensions. Being me, the question of how to efficiently generate data – in this case, 10 million rows of date fields with uneven distributions, with one date's value constricted by the other's – became a problem in its own right.
I began with a table I already had in Problem called, suitably, LargeTable. It has two columns, KeyVal (1-10,000,000) and DataVal (randomly chosen number, 1-10):
IF OBJECT_ID('dbo.LargeTable') IS NULL
CREATE TABLE dbo.LargeTable (KeyVal INT NOT NULL, DataVal INT NOT NULL);
WITH Digits AS (SELECT d FROM
(VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(0)) D(d))
INSERT INTO dbo.LargeTable (KeyVal, DataVal)
SELECT CONVERT(INT, 1 + o.d + (10*t.d) + (100*h.d) + (1000*th.d) +
(10000*tt.d) + (100000*ht.d) + (1000000*m.d)) KeyVal,
CONVERT(INT,ROUND(RAND(CHECKSUM(NEWID()))*10, 0)) DataVal
FROM Digits o, Digits t, Digits h, Digits th, Digits tt, Digits ht, Digits m
ALTER TABLE dbo.LargeTable ADD CONSTRAINT PK_LargeTable PRIMARY KEY (KeyVal);
So far, so good, but if I want to find the most efficient way to work with date values, I'll need some dates:
ALTER TABLE dbo.LargeTable
ADD Date1_N DATE NULL,
Date2_N DATE NULL,
Date1_S INT NULL,
Date2_S INT NULL
I want some very specific things with these dates. For one, the Date1_N and Date1_S values should relate to the same date (with Date1_S being the difference in days between the system 0 date – 1/1/1900 – and Date1_N). The same goes for Date2_N and Date2_S. I want Date2 to always come after Date1, and neither Date1 nor Date2 should happen in the future (so Date1 might be a birthdate, for instance, and Date2 might be an enrollment date). I don't want an even distribution of rows across the dates – I want some to have significantly more representation than others.
This presents a lot of problems. If all I wanted were random values, I could populate Date1 and Date2 in much the same way that I did DataVal in the LargeTable generation, but statistically, using that method I would expect that half the time Date2 would come before Date1, and that both Date1 and Date2 would be evenly represented across the 10,000,000 records. I'm also going to be restricted by the random requirement – a lot of the tricks I might try with deterministic values will not work with randomly created values – several promising solutions ran very quickly, but produced a single random date for all ten million rows. There was no shortage of slow solutions, but I finally found one that updated the table in under three minutes.
First, I need to create a distribution pattern for the results. I'll create a temp table to hold values from the beginning of my date range – 1/1/1900 – through to the end – today's date. I'm going to call the integer value of that date "i". I want to create a random value for the number of records in my large table that will get that value (the date's distribution) – I'll call that "Freq" - and another random value to determine the order in which the dates will be applied – called "o". I'll also create a temp table called #Date to hold the date values when I generate them.
IF OBJECT_ID('tempdb..#Frequency') IS NOT NULL DROP TABLE #Frequency;
IF OBJECT_ID('tempdb..#Date') IS NOT NULL DROP TABLE #Date;
DECLARE @BeginDate DATE = '1/1/1900', @EndDate DATE = GETDATE();
DECLARE @CountDates INT = DATEDIFF(d,@BeginDate,@EndDate);
CREATE TABLE #Frequency (i INT, o INT, Freq INT)
CREATE TABLE #Date (RandomDateID INT, Date1_S INT, Date1_N AS (DATEADD(d,0,Date1_S)),
Date2_S INT, Date2_N AS (DATEADD(d,0,Date2_S)))
WHILE COALESCE((SELECT SUM(Freq) FROM #Frequency),0) <>10000000
MERGE #Frequency tgt
SELECT KeyVal i, CONVERT(INT, CONVERT(BINARY(3),NEWID())) o , CONVERT(INT, RAND(CHECKSUM(NEWID())) * @CountDates / 80) Freq
WHERE KeyVal BETWEEN DATEDIFF(d,0,@BeginDate) AND DATEDIFF(d,0,@EndDate)
ON tgt.i = src.i
AND tgt.o = src.o
WHEN NOT MATCHED BY TARGET THEN INSERT (i,o,Freq)
VALUES (src.i, src.o, src.Freq);
CREATE UNIQUE CLUSTERED INDEX CL_Frequency ON #Frequency (o, i);
Now I have about 45,000 rows (as of writing, there are just under 42,000 days between 1/1/1900 and the present day, but some days will be entered more than once in our frequency distribution, and it's possible that some days won't be represented at all.) The sum of all the values in the Freq column is at least ten million so we will end up with enough records to populate all ten million rows. Finally, we create an index on the table to speed processing in the following step. Next, we'll create some actual dates and assign those dates to a KeyVal by using the Row_Number windowed function.
This second insert is a bit more complicated than the first. Recall that we want the Date2 value to come a random number of days after the Date1 value, but before the EndDate. To make that happen, we're going to generate another number using RAND with the CHECKSUM of NEWID as its seed. RAND returns a decimal value between 0 and 1, so to get a random integer representing a date between two date values (#Frequency.i and @EndDate), I multiply the random decimal value by the difference between i and EndDate, round the product to the nearest integer and add the value of i.
Finally, we need this to produce a number of distinct records equal to the value found in the Freq column. We'll accomplish this by joining our #Frequency temp table to the LargeTable (using LargeTable's KeyVal field as a catalogue of integers) on the KeyVal value less than or equal to the Freq value.
INSERT INTO #Date (RandomDateID, Date1_S, Date2_S)
SELECT ROW_NUMBER() OVER (ORDER BY RandomOrder) RandomDateID, Date1_S, Date2_S
SELECT TOP 10000000 CONVERT(INT, CONVERT(BINARY(3),NEWID())) RandomOrder,
FROM #Frequency f
JOIN dbo.LargeTable t
ON t.KeyVal <>f.Freq
ORDER BY o) d;
CREATE UNIQUE CLUSTERED INDEX UCL_Date_RandomDateID ON #Date (RandomDateID);
As you can see, we need one more step as well – we don't want this distribution to be clumped, we want a single date to show up on average only once every 42,000 rows. So we're going to create yet another RandomOrder for our date records, then apply the Row_Number windowed function over the set, ordered by that RandomOrder. Finally, once again, we're going to index our temp table, so as to make the following step perform better. We now have a temp table with 10,000,000 million randomized date values, with the second date always falling after the first and both dates falling between the system zero and the present day. All we have to do now is apply the result:
SET Date1_N = d.Date1_N,
Date2_N = d.Date2_N,
Date1_S = d.Date1_S,
Date2_S = d.Date2_S
FROM dbo.LargeTable lt
JOIN #Date d
ON lt.KeyVal = d.RandomDateID;
On my local drive, the entire process runs in 02:39 – not bad for creating fields with random values where the value of one column is dependent on the random value created for another column.