├── LICENSE ├── README.md ├── assets └── overview.png ├── environment.yml └── src ├── datasets ├── confaide.csv ├── pku-rlhf-10k.csv ├── sst2.csv ├── stereoset.csv ├── toxigen.csv ├── truthfulqa_test.csv └── truthfulqa_train.csv ├── eval_trustworthiness.py ├── generate_activations.py ├── generate_steering_vector.py ├── mi_estimators.py ├── prompt_template.py ├── scripts ├── probing.sh └── steering.sh └── train_probes.py /LICENSE: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChnQ/TracingLLM/HEAD/LICENSE -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChnQ/TracingLLM/HEAD/README.md -------------------------------------------------------------------------------- /assets/overview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChnQ/TracingLLM/HEAD/assets/overview.png -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChnQ/TracingLLM/HEAD/environment.yml -------------------------------------------------------------------------------- /src/datasets/confaide.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChnQ/TracingLLM/HEAD/src/datasets/confaide.csv -------------------------------------------------------------------------------- /src/datasets/pku-rlhf-10k.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChnQ/TracingLLM/HEAD/src/datasets/pku-rlhf-10k.csv -------------------------------------------------------------------------------- /src/datasets/sst2.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChnQ/TracingLLM/HEAD/src/datasets/sst2.csv -------------------------------------------------------------------------------- /src/datasets/stereoset.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChnQ/TracingLLM/HEAD/src/datasets/stereoset.csv -------------------------------------------------------------------------------- /src/datasets/toxigen.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChnQ/TracingLLM/HEAD/src/datasets/toxigen.csv -------------------------------------------------------------------------------- /src/datasets/truthfulqa_test.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChnQ/TracingLLM/HEAD/src/datasets/truthfulqa_test.csv -------------------------------------------------------------------------------- /src/datasets/truthfulqa_train.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChnQ/TracingLLM/HEAD/src/datasets/truthfulqa_train.csv -------------------------------------------------------------------------------- /src/eval_trustworthiness.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChnQ/TracingLLM/HEAD/src/eval_trustworthiness.py -------------------------------------------------------------------------------- /src/generate_activations.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChnQ/TracingLLM/HEAD/src/generate_activations.py -------------------------------------------------------------------------------- /src/generate_steering_vector.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChnQ/TracingLLM/HEAD/src/generate_steering_vector.py -------------------------------------------------------------------------------- /src/mi_estimators.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChnQ/TracingLLM/HEAD/src/mi_estimators.py -------------------------------------------------------------------------------- /src/prompt_template.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChnQ/TracingLLM/HEAD/src/prompt_template.py -------------------------------------------------------------------------------- /src/scripts/probing.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChnQ/TracingLLM/HEAD/src/scripts/probing.sh -------------------------------------------------------------------------------- /src/scripts/steering.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChnQ/TracingLLM/HEAD/src/scripts/steering.sh -------------------------------------------------------------------------------- /src/train_probes.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ChnQ/TracingLLM/HEAD/src/train_probes.py --------------------------------------------------------------------------------