Sebastián Sarasti - Data Scientist

The data used in this study is sourced from the following Kaggle dataset.
The dataset comprises approximately 60,000 samples, categorized into three labels: HQ (High-quality posts without any edits), LQ_EDIT (Low-quality posts with a negative score and multiple community edits, though they remain open after these changes), and LQ_CLOSE (Low-quality posts closed by the community without any edits).

Special characters were removed from the text using regular expressions.
Following that, categories were transformed into labels suitable for the model.
The required tokens for the model were then calculated.
Subsequently, the data was saved in a dataset object to facilitate its use with PyTorch models.
This process was applied to both the training and test datasets.

A model was constructed using transfer learning based on the FNet architecture. FNet is a neural network that employs Fourier transformations to replace the self-attention mechanism found in transformer architectures.
The weights of the FNet architecture were obtained from the Hugging Face repository.
The final hidden state of FNet was flattened and connected to a Sequential model until have 3 neurons, each representing the probability for a label.
The architecture utilized to build this model can be seen in the following picture:

The model reached an accuracy of 0.79 with just three epochs of training the Sequential layers trained.
The master branch shows in the notebooks folder how the model was trained. Public code is available in the following GitHub repo.
The weights of the model trained can be found in the following HuggingFace repository.

Stack Overflow Questions Quality Rating