MisSynth

Datasets

GPT-5

Baseline experiment

Environment setup

Experiment hardware: M1 MacBook Pro with 32 GB of RAM

git clone --recursive https://github.com/mxpoliakov/MisSynth.git && cd MisSynth

export PYTHONPATH=$(pwd):$(pwd)/missci
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Vector store

Create a JSON vector store based on scraped articles (web, pdf) from the MISSCI dev split. All 30 articles were scraped and vectorized using NeuML/pubmedbert-base-embeddings with a chunk size of 512 and chunk overlap of 64.

python create_vector_store.py

Synthetic fallacies (fallacious premise and context)

Generate synthetic fallacies using the single class prompt template. A vector store is used to retrieve relevant article excerpts to support the argument claim—essentially functioning as a lightweight RAG with metadata filtering. The OpenAI o4-mini model is used to generate 30 synthetic fallacies per sample from the Missci dev split. Each fallacy includes both a fallacious premise and context.

Additionally, 15 synthetic claim–accurate premise pairs with real fallacies are generated for each entry in the dev split, using the synthetic claim-premise template.

export OPENAI_API_KEY=...
python generate_synthetic_data.py --prompt-template single-class-synthetic-fallacy-context --n-synthetic-entries 30
python generate_synthetic_data.py --prompt-template synthetic-claim-premise --n-synthetic-entries 15

You can also create and analyze a unified jsonl dataset (stored in dataset folder) via:

python create_unified_dataset.py
python analyze_synthetic_dataset.py

Fine-tune LLM on synthetic fallacies

Create a dataset using raw data from the previous step. For the baseline experiment, we will classify fallacies with premise using classify with definition template. Given the synthetic fallacies generated, we can fill out the template and provide responses to fine-tune the LLM. Let's fine-tune Phi-4 from Microsoft with synthetic fallacies.

python create_fine_tuning_dataset.py

python -m mlx_lm lora --model mlx-community/phi-4-8bit --data output \
--train --fine-tune-type lora --batch-size 1 --num-layers 16 --iters 500 --adapter-path adapters

Benchmark vanilla model vs fine-tuned model

Benchmark on test missci split to avoid data leakage:

python run_mlx_fallacy_classification.py --model-name phi-4-8bit
python run_mlx_fallacy_classification.py --model-name phi-4-8bit --adapter-path adapters

cd missci

python run-fallacy-classification-with-gold-premise.py parse-llm-output phi-4-8bit_cls_with_premise_classify-D_test.jsonl

python run-fallacy-classification-with-gold-premise.py parse-llm-output phi-4-8bit_cls_with_premise_classify-D_test_adapters.jsonl

Model	Vanilla acc	Vanilla F1	Finetune acc	Finetune F1	Lora layers	Params
LLaMA 2	0.577 (*)	0.464 (*)	-	-	-	70B
Phi-4 (8-bit)	0.667	0.550	0.762	0.690	16	15B

* Table 3 from MISSCI: Reconstructing Fallacies in Misrepresented Science

Cross-dataset evaluation

To test whether the pipeline generalizes beyond the MISSCI domain, we evaluate the best-performing fine-tuned model (LLaMA 3.1 4-bit, highest F1 on MISSCI) on external fallacy datasets. Adapters are from fine-tuning on MISSCI synthetic data only — no external samples are used during training. Each dataset has its own conversion script that maps source fallacy classes to the MISSCI 9-class taxonomy and produces a JSONL file compatible with the same classification pipeline.

MAFALDA

The MAFALDA dataset (NAACL 2024) contains 200 span-annotated text samples with 23 fallacy classes.

The conversion script maps 7 MAFALDA classes to 6 MISSCI classes:

MAFALDA class	MISSCI class
equivocation	Ambiguity
causal oversimplification	Causal Oversimplification
false causality	Causal Oversimplification
false dilemma	False Dilemma / Affirming the Disjunct
hasty generalization	Hasty Generalization
false analogy	False Equivalence
fallacy of division	Fallacy of Division/Composition

16 MAFALDA classes have no MISSCI equivalent (mostly emotion- and credibility-based fallacies) and are skipped. 3 MISSCI classes (Impossible Expectations, Biased Sample Fallacy, Fallacy of Exclusion) have no MAFALDA counterpart.

python create_mafalda_dataset.py

This produces dataset/mafalda.test.jsonl (103 evaluation entries):

python run_mlx_fallacy_classification.py --model-name Llama-3.1-8B-Instruct-4bit --dataset-path dataset/mafalda.test.jsonl
python run_mlx_fallacy_classification.py --model-name Llama-3.1-8B-Instruct-4bit --dataset-path dataset/mafalda.test.jsonl --adapter-path adapters

Model	Vanilla acc	Vanilla F1	Finetune acc	Finetune F1	Lora layers	Params
LLaMA 3.1 (4-bit)	0.087	0.075	0.222	0.301	32	8B

Logic

The Logic dataset (EMNLP 2022 Findings) contains sentence-level fallacy annotations across 13 classes from both educational (edu_*.csv) and climate-related (climate_*.csv) sources.

The conversion script maps 6 Logic classes to 5 MISSCI classes:

Logic class	MISSCI class
equivocation	Ambiguity
circular reasoning	Ambiguity
false causality	Causal Oversimplification
false dilemma	False Dilemma / Affirming the Disjunct
faulty generalization	Hasty Generalization
fallacy of logic	False Equivalence

7 Logic classes have no MISSCI equivalent (ad hominem, ad populum, appeal to emotion, fallacy of credibility, fallacy of extension, fallacy of relevance, intentional) and are skipped. 4 MISSCI classes (Biased Sample Fallacy, Fallacy of Division/Composition, Fallacy of Exclusion, Impossible Expectations) have no Logic counterpart.

python create_logic_dataset.py

This produces dataset/logic.test.jsonl (1547 evaluation entries):

python run_mlx_fallacy_classification.py --model-name Llama-3.1-8B-Instruct-4bit --dataset-path dataset/logic.test.jsonl
python run_mlx_fallacy_classification.py --model-name Llama-3.1-8B-Instruct-4bit --dataset-path dataset/logic.test.jsonl --adapter-path adapters

Model	Vanilla acc	Vanilla F1	Finetune acc	Finetune F1	Lora layers	Params
LLaMA 3.1 (4-bit)	0.178	0.175	0.269	0.278	16	8B

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.vscode		.vscode
dataset		dataset
images		images
missci @ 9b3ddc9		missci @ 9b3ddc9
prompt_templates		prompt_templates
vector_stores		vector_stores
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
analyze_synthetic_dataset.py		analyze_synthetic_dataset.py
common.py		common.py
create_fine_tuning_dataset.py		create_fine_tuning_dataset.py
create_logic_dataset.py		create_logic_dataset.py
create_mafalda_dataset.py		create_mafalda_dataset.py
create_unified_dataset.py		create_unified_dataset.py
create_vector_store.py		create_vector_store.py
generate_synthetic_data.py		generate_synthetic_data.py
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
ruff.toml		ruff.toml
run_mlx_fallacy_classification.py		run_mlx_fallacy_classification.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MisSynth

Datasets

Baseline experiment

Environment setup

Vector store

Synthetic fallacies (fallacious premise and context)

Fine-tune LLM on synthetic fallacies

Benchmark vanilla model vs fine-tuned model

Cross-dataset evaluation

MAFALDA

Logic

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MisSynth

Datasets

Baseline experiment

Environment setup

Vector store

Synthetic fallacies (fallacious premise and context)

Fine-tune LLM on synthetic fallacies

Benchmark vanilla model vs fine-tuned model

Cross-dataset evaluation

MAFALDA

Logic

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages