Sebastián Sarasti - Data Scientist

The data used come from the following Kaggle data set.
The dataset roughly 400k samples, but in this case, it was only selected a random sample of 10% to train the models. This was done because it was not available enough power resources to transform the data.

It was removed special characters from the text.
After that, it was tokenized the text, and all tokens were lowercased to have a uniform vocabulary.
Next, it was removed all tokens which were considered stopwords and had lower length that 3 characters.
Then, it was steemed the tokens to have a lower vocabulary.
Finally, with the data cleaned it was calculated the TF-IDF matrix and reduced for 5 dimensions with Principal Component Analysis (PCA).
70% of data were used for training, while 30% for testing.

It was built several models and assessed which one gave the best performance (Random Forest Classifiers, XGBoost Classifiers, Logistic Regressions, and K-NN Classifiers).
All models were logged with MLFlow and the results were saved in a Dagshub repository.
The metrics used to evaluated the model's performance were the accuracy, precision, recall, and F1-score.
The architecture used to deploy this model is shown in the following picture:

The best model was the Random Forest based and achieved the following metrics (0.92 accuracy; 0.88 F1-score for class AI, and 0.93 F1-score for human-based class).
The master branch shows how the code were developed, while the deploy branch shows the code for the streamlit app. Public code is available in the following GitHub repo.
The MLFlow experiments are available in the following Dagshub repository.

Text Detection Written by Artificial Intelligences (AIs)