Evaluating Text Quality of GPT Engine Davinci-003 and GPT Engine Davinci Generation Using BLEU Score


  • Yayan Heryanto Faculty of Communication and Information Technology, Universitas Nasional, Jakarta, Indonesia
  • Agung Triayudi Faculty of Communication and Information Technology, Universitas Nasional, Jakarta, Indonesia https://orcid.org/0000-0002-1269-5888




Davinci-003, GPT Engine, BLEU Score


The improvement of text generation based on language models has witnessed significant progress in the field of natural language processing with the use of Transformer-based language models, such as GPT (Generative Pre-trained Transformer). In this study, we conduct an evaluation of text quality using the BLEU (Bilingual Evaluation Understudy) score for two prominent GPT engines: Davinci-003 and Davinci. We generated questions and answers related to Python from internet sources as input data. The BLEU score comparison revealed that Davinci-003 achieved a higher score of 0.035, while Davinci attained a score of 0.021. Additionally, for the response times, with Davinci demonstrating an average response time of 4.20 seconds, while Davinci-003 exhibited a slightly longer average response time of 6.59 seconds. The decision of whether to use Davinci-003 or Davinci for chatbot development should be made based on the specific project requirements. If prioritizing text quality is paramount, Davinci-003 emerges as the superior choice due to its higher BLEU score. However, if faster response times are of greater importance, Davinci may be the more suitable option. Ultimately, the selection should align with the unique needs and objectives of the chatbot development project.


Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, et al. 2021. Beyond english-centric multilingual machine translation. The Journal of Machine Learning Research, 22(1):4839–4886.

Young Jin Kim, Ammar Ahmad Awan, Alexandre Muzio, Andres Felipe Cruz Salinas, Liyang Lu, Amr Hendy, Samyam Rajbhandari, Yuxiong He, and Hany Hassan Awadalla. 2021. Scalable and efficient moe training for multitask multilingual models. arXiv preprint arXiv:2109.10465.

Marta R Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. 2022. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.

Papineni, K., Roukos, S., Ward, T. and Zhu, W.J. 2002. "BLEU: amethod for automatic evaluation of machine translation". In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL '02). Stroudsburg, PA, USA, pp. 311-318.

B. Rathore, "Future of AI & Generation Alpha: ChatGPT beyond Boundaries", EDUZONE: International Peer Reviewed/Refereed Multidisciplinary Journal (EIPRMJ),ISSN: 2319-5045Volume 12, Issue 1, January-June, 2023, Vol. 12 No. 1 (2023): Volume 12, Issue 1, ISSN: 2319-5045.

A. Hetler, What is ChatGPT?, accessed October 2023, https://www.techtarget.com/whatis/definition/ChatGPT.

Zhai, X. (2023). ChatGPT User Experience: Implications for Education. SSRN, from https://dx.doi.org/10.2139/ssrn.4312418.

W. Jiao, W. Wang, J. Huang, X. Wang, and Z. Tu, "s ChatGPT A Good Translator? Yes With GPT-4 As The Engine", Computation and Language (cs.CL), doi: 10.48550/arXiv.2301.08745.

K.Papineni, S.Roukos, T. Ward, "Corpus-based Comprehensive and Diagnostic MT Evaluation: Initial Arabic, Chinese, French, and Spanish Results", In Proceedings of Human Language Technology 2002, SanDiego, CA. To appear.

K. Papineni, S. Roukos, T. Ward, and W. Zhu, "BLEU: a Method for Automatic Evaluation of Machine Translation", Computational Linguistics (ACL), Philadelphia, July 2002, pp. 311-318.

Guru99, https://www.guru99.com/pdf/python-interview-questions-answers.pdf, accessed Oct 2023

C. Shao, J. Zhang, Y. Feng, F. Meng and J. Zhou, "Minimizing the Bag-of-Ngrams Difference for Non-Autoregressive Neural Machine Translation", The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), 2020, doi: 10.1609/aaai.v34i01.5351.

V. Chekalina, A. Bondarenko, C. Biemann, M. Beloucif, V. Logacheva, A. Panchenko, "Which is Better for Deep Learning: Python or MATLAB? Answering Comparative Questions in Natural Language", Association for Computational Linguistics, 2021, doi: 10.18653/v1/2021.eacl-demos.36

Ziyu Yao, Daniel S. Weld, Wei-Peng Chen and Huan Sun, "StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow", Proceedings of the 2018 World Wide Web Conference, 2018, doi: 10.1145/3178876.3186081

E. Nouri, R. Sim, A. Fourney, R. W. White, "Proactive Suggestion Generation: Data and Methods for Stepwise Task Assistance", SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020, doi: 10.1145/3397271.3401272

Y. Wu, S. Zhao, "Community answer generation based on knowledge graph", Information Sciences, 2020, doi: 10.1016/j.ins.2020.07.077

Y. Li, Q. Pan, S. Wang, T. Yang, "A Generative Model for category text generation", Information Sciences, 2018, doi: 10.1016/j.ins.2018.03.050

M. Evtikhiev, E. Bogomolov, Y. Sokolov and T. Bryksin, "Out of the BLEU: How should we assess quality of the Code Generation models?", Journal of Systems and Software, 2023, doi: 10.1016/j.jss.2023.111741

J. Savelka, A. Agarwal, C. Bogart, Y. Song and M. Sakr, "Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses?", Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V.1 (ITiCSE 2023) 117-123, doi: 10.48550/arXiv.2303.09325

A. Vyawahare, R. Tangsali, A. Mandke, O. Litake and D. Kadam, "PICT@DravidianLangTech-ACL2022: Neural Machine Translation On Dravidian Languages", Computation and Language (cs.CL), 2022, doi: 10.48550/arXiv.2204.09098

L. Benkova and Ľ. Benko, "Evaluation of Various Approaches to Compute BLEU Metrics", Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2022, pp. 71–78, 2022.

A. S. M. M. Hasan, S. Islam and M. A. Rahman, Performance Analysis of Different Smoothing Methods on n-grams for Statistical Machine Translation, International Journal of Computer Applications (0975 – 8887) Volume 46– No.2, May 2012.