Explanation: Explanation
itemsDf.sample(fraction=0.1, seed=87238)
Correct. If itemsDf has 10,000 rows, this code block returns about 1,000, since DataFrame.sample() is never guaranteed to return an exact amount of rows. To ensure you are not returning
duplicates, you should leave the withReplacement parameter at False, which is the default. Since the QUESTION NO: specifies that the same rows should be returned even if the code block is run
twice,
you need to specify a seed. The number passed in the seed does not matter as long as it is an integer.
itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536)
Incorrect. While this code block fulfills almost all requirements, it may return duplicates. This is because withReplacement is set to True.
Here is how to understand what replacement means: Imagine you have a bucket of 10,000 numbered balls and you need to take 1,000 balls at random from the bucket (similar to the problem in the
question). Now, if you would take those balls with replacement, you would take a ball, note its number, and put it back into the bucket, meaning the next time you take a ball from the bucket there
would be a chance you could take the exact same ball again. If you took the balls without replacement, you would leave the ball outside the bucket and not put it back in as you take the next 999
balls.
itemsDf.sample(fraction=1000, seed=98263)
Wrong. The fraction parameter needs to have a value between 0 and 1. In this case, it should be 0.1, since 1,000/10,000 = 0.1.
itemsDf.sampleBy("row", fractions={0: 0.1}, seed=82371)
No, DataFrame.sampleBy() is meant for stratified sampling. This means that based on the values in a column in a DataFrame, you can draw a certain fraction of rows containing those values from
the DataFrame (more details linked below). In the scenario at hand, sampleBy is not the right operator to use because you do not have any information about any column that the sampling should
depend on.
itemsDf.sample(fraction=0.1)
Incorrect. This code block checks all the boxes except that it does not ensure that when you run it a second time, the exact same rows will be returned. In order to achieve this, you would have to
specify a seed.
More info:
- pyspark.sql.DataFrame.sample — PySpark 3.1.2 documentation
- pyspark.sql.DataFrame.sampleBy — PySpark 3.1.2 documentation
- Types of Samplings in PySpark 3. The explanations of the sampling… | by Pinar Ersoy | Towards Data Science