Creating own datasets for NLP Projects using faker library.

In most of the NLP Problems, we need a large dataset to train the model. But there are relatively fewer datasets for NLP Problems available on the internet. So Faker library comes as an aide to create our Datasets.

Topics covered in this article:

  1. Faker Library and its functionality
  2. Create our Dataset

Faker Library

1.Install Faker with pip

pip install Faker

2. Importing the faker library

from faker import Faker
#Initilaze the Faker function with a variable
fake = Faker()

3. Exploring the functionality of the Faker Library

Let's create a Name, Phone number, Email-Id, Address dataset using Faker

print("Phone Number: ",fake.phone_number())
print("Address:" , fake.address())

We can generate random text in many languages (for eg: Spanish, Hindi, Japanese, French, and many more languages). You can refer to this documentation to know the languages supported in Faker Library —

faker_hi = Faker(['hi_IN'])
for i in range(2):
#Creating random text using Faker library.
faker_es = Faker(['es_ES'])

Creating Custom Dataset using Profile

Let’s create a Fake profile using the Faker library and create a pandas Dataframe using the Faker library.

Creating a 100 fake profiles dataset using the library.

import pandas as pd #Importing pandas
from faker import Faker #Importing the Faker
fake=Faker() #Initializing the Faker
profile = [fake.profile() for i in range(100)] #Creating fake profile
df = pd.DataFrame(profile)df.head() #Converting into a dataframe

For more functionality refer to the Faker documentation —


In this article, we have seen the functionality of the Faker library and created a fake profile dataset. We can utilize this Faker Library to generate a dataset in all languages for ML and NLP projects.

Working as a Software Developer in ML/AI. Data Science Enthusiast.